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ABSTRACT 

This report is aimed at helping Congress better 
understand the functions, history, capabilities, limitations, uses, 
and misues of educational tests; learn more about the promises and 
pitfalls Of new assess.nent methods and technologies; and identify ane. 
weigh polj.cy options affeccing educational policy. To prepare this 
report, the Office of Technology Assessment (OTA) examined 
technological and institutional aspects of educational testing. This 
report synthesizes the uTA's findings and outlines options for 
congressional action. The following chapters are included: U) 
"Summary and Policy Options"; (2) "Testing in Transition"; (3) 
"Educational Testing Policy: The Changing Federal Role"; (4) "Lessons 
from the Past: A History of Educational Testing in the united 
States"; (5) "How Other Countries Test"; (6) "Standardized Tests in 
Schools: A Primer"; (7) "Performance Assessment: Methods and 
Characteristics"; and (8) "Information Technologies and Testing: 
Past, Present, Future." The OTA concludes that examining the 
capability of tests to meet specific objectives is necessary to 
resolve the conflict over testing in American schools. Issues now 
before the Congress that could fundamentally alter American testing 
are changes to the National Assessment of Educational Progress, 
proposals for national testing, and revisions to provisions for 
educationally disadvantaged children. Appendix A provides a 63-item 
list Of acronyms, and Appendix B provides a 12~item bibliography of 
related contractor reports. (SLD) 
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Foreword 



Education is a primary concern for our country, and testing is a primary tool of education. 
No other countrv tests its school children with tfie frequency and seriousness that characterizes 
the United States. Once the province of classroom teachers, testing has also become an 
instrument of State and Federal policy. Over the past decade in particular, tfie desire of the 
Cimgress and State Legislatures to improve education and evaluate programs has substantially 
intensified the amount and importance of testing, 

Because of these developments and in light of current research on thinking and learning, 
Congress asked OTA to provide a comprehensive report on educational testing, with emphasis 
on new approaches. Changing technology and new understanding of thinking and learning 
offer avenues for testing in different ways. These new approaches are attractive, but inevitably 
carry some drawbacks. 

Tbo often, testing is treated narrowly, rather than as a flexible tool to obtain information 
about important questions. In this report, OTA places testing in its historical and poUcy 
context, examines the reasons for testing and the ways it is done, and identifies particular ways 
Federal policy affects the picture. The report also explores new approaches to testing that 
derive from modem technology and cognitive research. 

The advisory panel, workshop participants, reviewers, and other contributors to this study 
woe instrumental in defining the key issues and providing a range of perspectives on them. 
OTA tiianks them for their commitment of energy and sense of purpose. Their participation 
does not necessarily represent endorsement of tlie contents of this report, for which OTA bears 
sole responsibility. 
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CHAPTER 1 

Summary and Policy Options 



The American educational system is unique. 
Among the first in the world to establish a commit- 
ment to public elementary and secondary schooling 
for all children, it has achieved an extraordinary 
record: enrollment rates of school-age children in the 
United States are among the highest in the world, 
and over 80 peicent finish high school in some form 
between the ages of 18 and 24.^ This tradition of 
education for the masses was nurtured in a system 
that, by all outward appearances, is complex and 
fragmented: 40 million children enrolled in some 
83,000 schools scattered across some 15,000 school 
districts. Pluralism, diversity, and local control — 
hallmarks of American democracy — distinguish the 
American educational experiment from others in the 
world. 

Student testing has always played a pivotal role in 
this experiment. Every day millions of school 
children take tests. Most are devised by teachers to 
see how well their pupils are learning and to signal 
to pupils what they should be studying. Surprise 
quizzes, take-homs written assignments, oral pre- 
sentations, pretests, retests, and end-of-year compre- 
hensive examinations are all in the teacher's tool- 
box. 

It is another category of test, however— originating 
outside the classroom, usually with standardized 
rales for scoring and administration — that has gar- 
nered the most attention, discussion, and contro- 
versy. From the earliest days of the public school 
movement, American educators, parents, policy- 
makers, and taxpayers have tumed to these tests as 
multipurpose tools: yardstick of individual progress 
in classrooms, agent of school reform, filter of 



educational opportunity, and barometer of the na- 
tional educational condition. 

Commonly referred to as ''standardized tests,**^ 
these instruments usually serve management frmc- 
tions; they are intended to inform decisions made by 
people odier than the classroom teacher. They are 
used to monitor the achievement of children in 
school systems and guide decisions, such as stu- 
dents' eligibility for special resources or their 
qualification for admission to special school pro- 
grams. CMdren's scores on such tests are often 
aggregated to describe the performance of class- 
rooms, schools, districts, or States. With technologi- 
cal advances, these tests have become more reliable 
and more precise, and their popularity has grown. 
Ibday they are a fixture in American schools, as 
common as books and classrooms; standardized test 
results have become a major force in shaping public 
attitudes abcut the quality of American schools and 
the capabilities of American students. 

Testi;r!g at a Crossroads 

Ibsts designed and admmistered outside the 
classroom are given less frequently than teacher- 
made tests, but they are thoroughly entrenched in the 
American school scene and their use has been on the 
rise. One indicator of growth is sales of commer- 
cially produced standardized tests. Revenues from 
sales of tests used in elementary and secondary 
schools more than doubled (in constant dollars) 
between 1960 and 1989 (see figure 1-1), a period 
during which student enrollments grew by only IS 
percent.^ The rise in testing reflects a heightened 
demand from legislators at all levels — and their 
constituents — for evidence that education dollars 



•For current data comparing primaiy and secondary school enrollment rates In the United States and other countries, see U.S. Dq)artment of 
Education, National Center for Educatlcm Statistics, Digest ofEducaHon StatUtics, 1990 (Washiogton, DC: Febniaiy 1991). p. 380; and George Madaus. 
Boston College, and Ibonm Kella^ian. St. Pairida College. Dublin, **Student Examination Systems in the Eur(^>ean Community: Lessons for the 
United States.* • OTA contractor report. June 1991. For a thorough analysis of completion and dropout data, see U.S. Department of Education, National 
Center for Educati(m Statistics. Dropout Rates in the US: 1989 (Washington, DC: September 1990). With respect to postsecondaiy education, as weU. 
paiticipation rates of American high school graduates are the highest in tiie world: close to 60 percent of persons of college-going age were enrolled in 
postsecondary institutions in 1985. compared to 30 percent in France. Germany, and Japan, 2 1 percent in the United Kingdom, and 55 peicent in Canada. 
For details see Kennetii Redd and Wayne Riddle. CoQgressional Research Service. •'Crmparativc Education: Statistics on Education in the U.S. and 
Selected Foreign Nations.** 88-764 BP^. Nov. 14. 1988. 

*Ibsting terms have boUi technical and common meanings, and ofteo cause conftivioa Box I A is a glossary of woids used in Uiis report. ioA will 
help the reader understand the precise meaniqgs of these words. 

3U.S. Department of Education. Wg^jT of £dWcari(?« Statistics, 1990, op. cit.. lootnt'e 1. p. 12. The fact tiiat testing grew proportionaUy more rapidly 
than ti)e student population suggests that policymakers may have responded to incieasea enroUmcnts by attempting to institute giealcr administrative 
Q Hiciency in the schools. As discussed in ch. 4. Uiis is a familiar historical trtnd* 
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Figure M— Growth In Revenues From 1bst Sales and 
In Public School EnrollmentSt 1960^9 
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changa iaoomputad ovar 1 060 baia yaar (not ovar prior yaar laval). 
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mana SImora (ad.)« Tfw Bc¥tk9r Annual (Nawr YorK, NY: Raad 
Publlthing, 1070-1900). Enrollmantdata from U.S. Dapartmant 
of Education, National Cantar tor Educational Statistic*. DIgtst 
of Education $M$tic$, t99O{WasNn0ton. DC: Fabruary 1991^. 
p. 12. 

are spent effdctively. Holding schools and teachers 
''accountable" has increasingly become synony- 
mous with increased standardized testing. 

State and local governments have traditionally 
assumed the greatest share of elementary and 
secondary education funding, as shown in figure 1-2. 
State funding began to exceed local funding as a 
percentage of the total starting in the mid-1970s, and 
SUTte-mandated testing grew accordingly; 46 States 
had mandated testing programs in 1990 as compared 
to 29 in 1980.^ Similarly, increases in Federal 
education spending during the 1960s and 1970s 
spurred increases in testing as Congress sought data 
to evaluate Federal programs and monitor national 
educational progress. The Federal Government cur- 
rently spends over $20 billion per year on ele- 
meniary and secondary education in programs ad- 
ministered by over a dozen Federal agencies.^ 



Figure 1-2— Shifts Ir^ Federal, State, and Local 
Funding Pattemi for Public Elementary and 
Secondary Schools, Selected Years 



Percentage of total revenues 




_^ — J J. ^ , , p — f — , r- — I- 

1960 64 68 72 76 80 84 88 
School year 

SOURCE: U.S. Dapartmant of Education* National Cantar tor Education 
Statlttica, Dtgaat of EducalhnBl StMilsUoM ^99(7 (Waahlngton, 
DC: Fabruary 1991)v 

Outcome-based measures of the effectiveness of 
educational programs — generally achievement test 
scores — have become key elements in the congres- 
sional appropriations and authorization process. 

Contradictory demands for reevaluation of nesting 
have been caught up in recent school reform 
initiatives. On the one hand, many teachers, admin- 
istrators, and others attempting to redesign curricula, 
reform instruction, and improve learning feel sty- 
mied by tests that do not accurately reflect new 
educational goals. On the other hand, most leading 
educational measurement experts emphasize that 
conventional standardized tests are useful tools in 
gauging the strengths, weaknesses, and progress of 
American students. 

Motivated in part by changing visions of class- 
room learning and by firusiration with tests that many 
critics claim can hinder children's progress toward 
higher levels of achievement, many educators are 
turning to changed methods of testing. Some of these 
methods are modifications of conventional written 
tests; others are bolder innovations, requiring stu- 
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O JU.S. Department of Education* Dii^at of Education Statistics^ 1990, op. cit, footnote 1, p. 337. 
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Box 1-A— A Glossary of Testing Terminology 

A test score is an estimate . ft is based on 8aiiq>liiig yiAm the test taker knows or can do. For exanyle, by asking 
a sample of questions (drawn firam all the material that has been taught), a biolosy test is used to estimate how much 
bidogy the student has learned. Ibsts can jnovide vahiable information about an individual's competence, 
knowledge, skills. <x behavior. Achievement tests are intended to estimate wtat a student knows and can do in a 
specific subject as a result of schooling. Achievement tests and aptitude tests are botfi instruments ttat estimate 
aspects of an individual's developed aUMes; diey exist on a continuum, widi ttM former bdng more dosety tied 
to specific curricula and school t«Ograms and the latter intended to capture knowledge acquked bodt in and out of 
sciiooL 

Standardized teas are administered and scored under conditions unifimn to all students. Although most people 
associate standardized tests wiA the multiple-choice fonnat, it is inqrattant to emphasize that standardization h a 
generic concept that can appty to any testing foimat—fiom written essays to oral examinations to producing a 
portfolio. Standardization is needed to make test scores comparable and to assure as much as possible that test takers 
have equal chances to demonstrate they know. 

The word standards appUed to tests has at least two different meanings. In the more general context it denotes 
goals, desirable behavion, or models to vMch students, teachers, or sdiools should asjrire. Such standards describe 
what optimal performance kxto like and is desirable for students to know. For example, die National Council 
of Ibad^ of Matfieniatics has determined diat a standard for nuUhematics instniction is to emphasize madiematics 
as proUcffl aohring. The word standards, in its more technical meanfaig, denotes the specific levels of prqficiency 
that students are eiqpected to attain. Thus, setting a passing score for a test is equivalent to setting a standard of 
performance on that test. 

Because titey are based on samples cftKhaviw, tests are necessarily imprecise: semes can vary for leascms 
unrelated to die individual's actual achievement Tbst scores can only describe yAat skills have been mastered, but 
diey cannot, alone, ejqtlain why leandng has occurred, or prescribe ways to inq>rove it The fact diat achievement 
is affectedby schook, parents, honie badcground, and otfier fiactmv constrains die infe 
schools and progrRms. Tbst scores must be interpreted can^iilty. 

Reliability tthn to dw consistency ml generalizability of test data. Will a student's score today be close (if 
not identical) to her score tomorrow? Do ibd questions coveting a subset of skills generalize to die l»oader universe 
of skills? If tests are scored by human judges, to wiat extent do difiCerent judges agree in tiieir estimations of student 
achievonent? A test needs to demonstrate a high degree of reliability befme it is used to make decisions, particularly 
those widi high stakes attached. 

Milidity refers to whether or not a test measures what it is supposed to measure, and whedier appropriate 
inferences can be drawn from test results. \Uidity is judged from many types of evidence, including, in die views 
of some aqpats, the consequences of translating test-based infierenoes into dedskns or pedicles diat can affect indi- 
viduals or institutions. An acceptable levd of validity must be demonstrated beftxre a test is used to make decisions. 

There are two basic ways of interpreting student perfumance on tests. One is to describe a student's test 
performance as it compares to diat of otiier studento (e.g.. he typed better dian 90 percent of his classmates). 
Norm-referemed tests are designed to make this type of comparison. The otiier metiiod is to describe the skills or 
performance diat die student demonstrates (e.g., he typed 45 words per minute widiout errors). Criterion-referenced 
tests are designed to compare a student's test performance to deadly defined learning tasks or skill levds. 

Performance assessment refers to testing mediods dutt require students to create an answer or product tiiat 
demonstrates dieir knowledge os skills. Perfoimance assessment wan take many different forms inchiding writing 
short answers, doing mathematical computations, writing an extended essay, conducting an experiment, presenting 
an oral argument, or assembling a portfolio of representative woric. 

Constructed-response items are one kind of perfnmance assessment consisting of open-ended written items 
on a conventional test However, diey require students to produce die solution to a questi<Mi rather tiian to select torn 
an amy of possible answers (as multiple-choice items do). 

Con^uter-administered testing is a gennic term covering any test that is taken by a student seated at a 
computer. A special type of computer-administered testing is computer-adaptive testing, which applies die 
conqwter's memory and branching capabilities in order to adapt tiie test to the skill levels shown by the individual 
test taker as die test is taken. 

SOURCE: 0£Bc« of Ibclmology AiMMmeat. 1992. 

18 



6 • Testingin American Schools: Asking the Right Questions 



Photo cndtii Bob Dmwmrkih 

Most children In the United States take standardized achievement tests several times during elementary and secondary school. 
Standardized test results have become a major force In shaping public attitudes about the quality of American schools 

and the capabilities of American students. 



dents to demonstrate their knowledge and skills 
through methods known as * 'performance assess- 
ment/* Computer technologies, video, and inte- 
grated multimedia systems add capabilities and 
richness not usually attainable from conventional 
tests, and are gaining ground in assessment as well 
as instruction* 

These new approaches to testing have been fueled 
by some cognitive scientists who claim that complex 
thinking involves processes not easily reduced to the 
routinized tasks required on conventional tests. A 



recent report on science education, for example, 
argued that: 

Ratherthan mastering concepts, students believe that 
recognizing terms in a multiplen^hoice format is the 
appropriate educational goal. In the long run the 
impact of current modes of testing on enduring skills 
and strategies for learning will be inimical to re- 
form»^ 

In contrast, many testing professionals maintain 
that school improvement efforts must be constructed 
on a solid foundation of information about what 



^National Research Council, Fu'^Uing the Promise: Biology Education in the Nation's Schools (Washington, DC: 1990), p. 44. Another leccnt report 
concluded that . .to direct testing donga more constnictivc course, we muatdmw on ri^^ 

sources beyond multiple choice tests.'* See National Commission on Tbsting and Public Policy, From Gatekeeper to Gateway: jyansforming JksHng 
/«Awertca(Chc8tnutmH MA: Boston College, 1990). p.xi; also Walter Haney and George Madaus,**Seaix:hingforMcrnati^^^ 
1, Whats, and Whithcrs/* Phi Delta Kappan, vol. 70, No. 9, May 1989. 
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students are learning; well-designed tests, they say, 
if used and interpreted properly, can provide invalu- 
able information in a reliable, consistent, and 
efficient fashion. For example, standardized tests 
can inform policymakers by supplying trend data on 
the skill levels of American students, Recent analy- 
sis of data from the Iowa Ibsts of Basic Skills 
revealed that student performance improved be- 
tw^jen 1979 and 1985, even on test items designed to 
assess certain higher order skills, contradicting 
findings from other test data that improvements were 
limited to mechanical tasks J 

Measurement experts contend that these standard- 
ized tests are also useful to teachers, as tools to 
calibrate classroom impressions of student progress; 
they are viewed as one relatively efficient, albeit 
inexact indicator of how a given child or school 
system is progressing relative to students nation- 
wide. One test author expressed a view shared by 
many others in the testing conrununity; 

. . . comprehensive^ survey-type standardized actileve- 
ment tests have served a useful function in m(»iitor- 
ing the achievement levels of individual pupils and 
the aggregate groupings of these students in terms of 
classrooms, buildings, and the district ^ 

Common Ground 

lb outsiders listening in on this debate, it may 
appear that proponents of conventional and new 
forms of assessment are adversaries locked in an 
intractable stalemate. Closer inspection, however, 
reveals that testing policy is not a zero-simi game m 
which either existing testing or new method?? wi 'i, 
but an arena with multiple and mutually comp. ^ble 
choices. 

The trick is using the kind of test that is best 
suited to providing the desired type of informa- 
tion. Thus, although some activists in the debate 
have carved out extreme positions, most others 
agree on at least these two fundamental points: 

• different forms of testing caj, if used cor* 
rectly, enrich our understanding of student 
achievement; and 



• tests of any kind should be used only to serve 
the functions for which they were designed 
and validated. 

On this common groimd it may be possible to 
build genuine reform. One prominent psychologist 
and long-time participant in the politics and science 
of testing, commenting on what appears to be a rare 
opportunity, observed that: *\ . . our testing ecology 
is entirely manmade; what we made we can 
change/'^ 

Lessons of History 

But history tempers the optimism. Since the birth 
of mass public education in America some 150 years 
ago, iimovation in tests and testing has been most 
attractive during periods of heightened public anxi- 
ety about the state of the schools. During these 
periods, however, legislators and school officials 
feel the greatest pressure to act, and are most prone 
to rely on existing tests as levers of policy. Tnus, 
researchers and policymakers involved in the pains- 
taking process of cuiricular reform and new test 
design often find themselves at odds with those who 
demand quicker and more immediately noticeable 
action. Hence (as described in detail in ch. 4), tests 
have too often been used to serve functions for which 
they were not designed or adequately validated. 
Within the education policy and research commu- 
nity, therefore, there is an undercurrent of concern 
that new tests will, as in the past, be implemented 
before they have been validated and before their 
effects on learning can be understood. 

For some educators the principal concern is that 
new tests will raise new barriers — to women, people 
of color, other minorities, and the economicidly 
disadvantaged. On these issues, too, caution flags 
are up: precisely because testing has historically 
been viewed as a means to achieve educational 
equity, tests themselves have always been scruti- 
nized on the question of whether they do more to 
alleviate or exacerbate social, economic, and educa- 
tional disparities (see box 1-B). 



'^See Elizabeth Witt, Myunghee Han, and H.D. Hoover, ' 'Recent Trends in Achievement Ibsts Scores: Which Students arc Improving and on What 
Levels of Sldll Complexity?* * paper presented at the annual meeting of the National Council on Measurement in Education, Boston, MA, 1990. See also 
Robert Linn and Stephen Dunbar, ' 'The Nation's Report Card Goes Home: Good News and Bad About Trends in Achievement,* * Phi Delta Kappan, 
vol 72, No. 2, October 1990. p, 132. For a thorough analysis of trends in achievement that illustrates the importance of ushig multiple measures of 
performance, see Daniel Koretz, Trends in Educational Achievement (Washington, DC: Congressional Budget Office, 1986). 

•Herbert Rudman, "The Future of Toting is Now/* Educational Measurer/tent: Issues and Practice, vol. 6, No. 3, fall 1987, p. 6, 

Q ^Sheldon White, ptofcsscr of psychology, Harvard University, personal communication, June 1991. 
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Box i-B—Equtty, Fairness, and Educational Testing 

Steven Jay Oould's seminal treatise on tlie histoiy of intelligence testing is dedicated to "... the memoiy of 
Qtmay and Papa Joe, who came, straggled, and prospered, Mr. Goddaid notwithstanding."^ Fiom his voy first 
pa^, then, GouW telegraphs the deeply emotiomd choids strjck ly concepts of psy^ 
testing. As Gould explains midw^ through the twok, Ooddaid had been one of a handfiil of prominent American 
pychotogists who used test data to advance racist, xenophobic, and eiigenicist ideologies. Aldiough Qoddaid 
htoself latw recanted,^ hi one of the more hnpresshre turnarounds in the histoiy of science, the atmosphere of the 
1920s and IJWOs gave tests "... the nther happy property of behig a conservative social innovation. Hiey could 
be perceived as justifying the richness of die rich and the poverty of die poor; dwy l^timized die existhis social 
order. ' * 

TTie historical misuse of intelligence tests and diek achievement test cousins— to bolster support forrestrictive 
immigiadon laws, to limit college admissions, and to hibel children as uneducable-has left an indelible stain on 
ttie »ci«ice of mental measurement^ It is no wonder diat testiqg poBcy arouses die passions of Americans 
concerned wdi equal opportunity and social mobility. As in die past, dwse passions rm 
may agree diat tesdqg can be a wedge, but some see die wedge fixcing open die gates of opportunity while odiers 
see It as di<; doorstop keeping die gates dghdy shut. • 7 

ConsidCT, fofexample. die foUowfaig excopts, bodi from individuals deeply concerned widi opportunities for 
minority and disadvantaged children: 

^n^iity youngsten who ... are diipnpoftioiistdy among the poor, tend to be relegated to poor schools, or 
tt^red out of tcwlemlc counea. just as yooof women are 

difference in the group" scores [00 the Scfadastic Aptitude TfestJ. . . represent anydiiog but "bias." Rather, the 

score is a faithfW messenger of the unequal distribution in our country of educatiooal resources an^ 

Tbst makm claim dial the lower test scores of HKdal and ethnfc mlnoi^ 

aimpty reflect the biases and Inequities that exist fa American schooU and Amcri^ 

ceitaMyexl8t--^t8tandaidized tests do not merefy reflect dieir intact; di^con|K)^ 



•Steven Jay Gould. Tht MUmeasure of Man (New Yoik, NY: Norton. 1981). dedicMioo. p. 7. 

2s«e. e,K.. Carl Degler, In Search cf Hitman Nature (London. Ei^ilMid: Odbid Univcnity Pimi. 1991). 

Je tower ctaiiei and tamignmti ... (and 

mil, i9¥J)i or KiaMfd Henniteiiia "IQ,' * AOamlc MomUy, vol. 228, Scpleolber 1571, pp. 43-64. 
detalU oo llio Uitoiy of ach^ 

Stiiiditfdi^lbftiivaiiddi^ 

Amerk»B(tacttk«,'*ipee^ *^ «a inc ruwio oi 



The Purpose of This Report 

Federal policymakers are caught in an unenviable 
dilemma. On the one hand they must satisfy the 
growing demand for accountability, which is often 
expressed in terms of simple questions: Do the 
schools work? Are students learning? On the other 
hand, they must also be responsive to growing 
disaffection with the quality of data on which 



administrators rely for evaluations of programs: 
achievement scores are rough indicators, at best, of 
progress in attaining the many goals of federally 
funded programs. Not surprisingly. Federal evalua- 
tion requirements that place additional testing bur- 
dens on grantees and program participants often spur 
an interest in revising those very requirements. As 
the Federal Government has become a more promi- 
nent player in elementary and secondary education, 



K*or example, the Department of Educitioa reccuUy formed a task force to look into problems of testing and evaluation for the CbaDter 1/TiUc I 
^-icnsatory education program. Sec cIl 3 of this report ^^i'^^ n tiuc 
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Theie exoeipu iiuA:e clear the ne^ 
that teats can be uaed to identify ineqiialitiea in educational oppoitunitiea.^ But die qiieation becomea how to use 
that information. Advocatea <tf testing ai a ''gatekeqper'' aigoe tfiat ability and adUevement, raAer than ficnily 
background, class, or die specific advantages that 

the (ttstribution of opportunities and rewards in society, Kfoieover, diey add, ttiia system of distribution creates 
incentives fixr schod sy sterna to provide dieir students widi tfie b^ 

On die other hand, opponents contend diat ability and achievement scores are highly conebted widi 
socioeconomic background foctora' and witfi tibe quality <^ sdioofing diildren received; under diese circumstances, 

no assessment can be considered equitable for stii^ 
material iqpon ^di die assessment is based*'*^^ 

This debate will not be resolved easily or quickly; nor will it become moot widi die advent of alternative 
mediodsof assessment On die contraiy» it could very weU become even more heated 
testing policy in die United States is at a crossroads, and if history si^^ 
depend in large part on basic issues of equity, fshnoM 

disadvantaged The core questions are well summarized hi a recent hook on science assessment: 

Aie we better off widi die flawed vstem now in place or widi an ni^^ 
even greater proUems? What disn^raoes in oppoftnnity to 
students, teadieri, or parents do someddi^diffeiem to promote learniiq^ 
to die neediest stodems or piovidfaiff sununer instniction fiMT student 
Anddoes better assessmem increase oar req[M)nsihilityto 
die demand and die edited dilemniM we fi^etedetetmining^ 
to d(> more, oooe we know more, pedufM die dsngen of ineqoity possSik 
absem die resolve to intervene, one could aigiie dutt assessmem becomes li^ 



TFot dliciiitkm of lest Was sod the effiectt of lestlqg on mioorlty students, see, e.g., Welter Haoey, Boston College, **lbsttag and 
Minoritiei," dnft snooi^R^ IsMSiy 1991, p. 24. 

^See, e.f^ OdttophBT Jen±8 et sL, /iiefi^ NY: BMic Books, 1972). 

9See, 04., Roosld Feigoioo. * *Hfia§ for PubUc Edocsdoo: New Bvldeooe on How and Wby Money MittetB,' ' HarvarJ Journal on 
Uglslatioiu VOL 2S, No. 2, somoicr 1991, pp. 46S-498. 

iQSblriqrMdoom, "Bqdty and Bxcellenoe Ihro^^ 
GefaldKiilmandSlilrlqrMdcom(eds.)(WaA^^ 
to note that staodaidixed lest scoies, viewed by some 
nuyorpobUcpmgnmtohdpniinori^anddi^^ 
of mlnorify and Inner ciQf chUdm 

aod Seconilafy BdocaHon Act of 19S5. . . (Haney, op. ck., footnote 7, p. 22.) 

^ ^Sooe mtoorHy edttcalors, for example, fear dm new attcannen t methodi win adUe opportnnities for minority stndems idio have 
recently b^on lo do twtter on conveotiooal teats. Itee if also unceitaiDty over wfaeOier or not tests itMMild be need for placii« cfaUdren in 
fcmedial ^TOfnom. Pannis in CdiforD^ 
followed die pieoedent aet in die landmaA iroftMW V. 
diildieo in lemedial tnKdoi. Ite fMte 

^^Malcom, op. dt., footnote 10, p. 320. 



and as the public's attitudes toward concepts of 
national educational goals and standards have evolved, 
Congress has become more involved in the testing 
debate.** 

Congress has a stake in U.S. testing policy for 
three main reasons: 



• to ensure that accurate and reliable data about 
American educational achievement are pro- 
vided to lawmakers, program administrators, 
parents, teachers, test takers, and the general 
public; 

• to ensure that the tests used to evaluate Federal 
education programs do not, in themselves. 



' 'A 1989 Gallup poll found that (he nujoiity of respondents supported the idea of national achievement standards and goals, but few supported either 
State or Federal intervention in the definition of those standards and goals. For discussion see George Madaus, Boston College, aod Thomas Kellaghan, 
St. Patricks College, DubUn, * 'ExaminationSystems in the European Conununity: Implications for aNational Examination System hi the United States/' 
'^TA contractor report, April 1991. 
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impede progress toward program goals; and 

• to ensure that tests are used fairly and do not 
infringe on individual rights or impose unac- 
ceptable social costs. 

Congress faces a variety of decisions that could 
have significant and long-tem> effects on the scope, 
quantity, and quality of testijig in the United States. 
Issues related to national testing and the role of tests 
in Federal education programs are already on the 
congressional agenda; issues regarding the rights of 
test takers may emerge, as they have in previous 
times, if new national and State tests are mandated 
or if the stakes attached to existing tests are raised. 

This report is aimed at helping Congress: 

• better understand the functions, history, capa- 
bilities, limitations, uses, and misuses of educa- 
tional tests; 

• learn more about the promises and pitfalls of 
new assessment methods and technologies; and 

• identify and weigh policy options affecting 
educational testing. 

lb unravel the complexities of these topics, OTA 
examined technological and institutional aspects of 
educational testing. This summary and policy chap- 
ter synthesizes OTA's findings on tests md testing, 
and outlines options for congressional action. Chap- 
ter 2 examines recent changes in the uses of testing 
as an instrument of policy, chapter 3 covers current 
issues affecting the role of the Federal Government 
in educational testing, chapto* 4 reviews the history 
of testing in the United States, and chapter 5 
considers lessons from testing in selected European 
and Asian countries. The final three chapters focus 
on the tests themselves. Chapter 6 explains charac- 
teristics and purposes of existing educational tests, 
and examines the reasons new te&l designs seem 
warranted. Chapter 7 explores various approaches to 
performance assessment and how these methods are 
being implemented in schools, and chapter 8 exam- 
ines the current and future roles of computers and 
other information technologies in assessment. 

In this report, the analysis and discussion are 
framed in terms of the functions of testing. OTA 
concludes that examining the capability of various 
tests to meet specific objectives is the necessary first 
step in abating the seemingly endless controversy 



over the quantity and format of testing in American 
schools, and in laying the groundwork for new 
approaches. 

The Functions of Testing 

Educational tests have traditionally served many 
purposes that can be grouped into three basic 
functions: 

• to aid teachers and students in the conduct of 
classroom learning; 

• to monitor systemwide educational outcomes; 
and 

• to inform decisions about the selection, place- 
ment, and credentialing of individual students. 

These three fiinctions have a common feature: 
they provide information to support decisionmak- 
ing. However, they differ in the kinds of information 
they seek and the types of decisions they can 
support, and test results appropriate for some deci- 
sions may be inappropriate for others. 

Classroom Feedback for Students 
and Teachers 

Ibachers must constantly adapt to the behaviors, 
learning styles, and progress of the students in their 
classrooms.^^ Tfests can help them organize and 
process the steady stream of data arising from 
classroom interactions. Just as physicians use body 
temperature, blood pressure, heart rate, x rays, and 
other data to form an image of the patient's health 
and to determine appropriate treatments, teachers 
can use data of various tyj • to better manage their 
classes and, in some circumstances, to tailor lessons 
to the specific needs of individual students. Students 
can use information to gain sharper understanding of 
their strengths and weaknesses in different subjects 
and can adjust their study time accordingly. 

Ibsts that can aid classroom instruction and 
learning need to: 

• provide detailed information about specific 
skills, rather than global or general scores; 

• be Imked to content that is taught in the 
classroom; 

• be administered frequently; 

• give feedback to students and teachers as 
quickly as possible; 



^2Fcr a reccnl analysis of the intenial workings of classrooms aud implications for education policy, sec Edward Pauly , The Classroom Crucible • What 
O / Wc?r*j, What Doesn't, and Why (New York, NY; Basic Books, 1991), especially ch, 4. 
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A Student in 1943 takes her oral speHIng examination after 
completing a written examination on the l)lackboard* 

Teachers have always used a variety of tests to help them 
manage their classes and evaluate student progress 

e be scored or graded to help students lerim from 
their errors and misunderstandmgs, and help 
teachers intervene when students get stuck; and 

• be based on clear and open criteria for scoring 
so that students know what to study and how 
they are being evaluated. 

System Monitoring 

How well is a school or school system perform- 
ing? This is a question often posed from the outside, 
by parents, legislators, and others with particularly 
high stakes in the answer. As shown in chapters 2 
and 4, the question is usually posed with more 
urgency when the impression is that the answer will 
be **not very well.** 

Educational tests of various sorts have long been 
viewed as objective instruments capable of provid- 



ing systematic and informed answers about the 
learning that takes place in schools. In an educa- 
tional system as decentralized and diverse as the 
American one, there is a nearly insatiable qipetite 
for evidence that all schools are providing children 
with a decent education. Since the mid- 19th century, 
tests have been used to determine how much 
students in different schools or school districts were 
learning. Recent increases in Federal expenditures 
have stimulated new demands for system accounta- 
bility. 

Ibst scores alone cannot reveal how or why 
learning has occurred, or the degree to which 
schools, parents, the child's home background, or 
other factors have affected learning. When com- 
bined s^propriately with other data, however, such 
as prior test results and children's socioeconomic 
status, test results can help explain — as well as 
describe — the outcomes of schooling.^^ 

For tests to yield meaningful comparisons across 
schools and districts, they must: 

• be uniformly and impartially administered and 
scored; and 

e meet reasonable standards of consistency, fair- 
ness, and validity. 

In addition, to be useful system monitoring tools, 
these tests: 

• should provide general information about 
achievement, rather than detailed information 
on specific skills; 

• should describe the performance of groups of 
students — classrooms, schools, districts, or 
States — rather than individuals (thereby allow- 
ing the use of sampling methods that yield the 
desiied information without the costly testing 
of every student); and 

e can be administered infrequently (once or twice 
a year at the most). 

Selection, Placement, Credentialing^^ 

Tfests designed to provide data about individual 
students* current achievement or predicted perform- 



i^For example, recent analysis of data from close to 1 ,000 school districts in Ibxas found significant differences in student achievement scores that 
could be explained by variations in measures of teacher quality and other inputs. Sec Rorald Ferguson, **Paying for PubUc Education: Nevir Evidence 
on How and Why Money Matters,* * Kurvard Journal on Legislation, vol. 28. No. 2. sunmw^ 1991. pp. 465-498; and Richard Mumaic. **Inteiprctlng 
the Evidence on *Docs Money Matter?* Harvard Journal on Legislation, vol. 28. No. 2. summer 1991, pp. 437-464. 

^^Tbese three terms overlap. However, selection refers primarily to decisions about a student* s qualifications for admission to schools; pUcement refers 
to decisions abouf qualifications of students to participate in programs ^ ithin schoob they attend; and cxedentialing (or certification) refers to decisions 
•^^arding profic^^encies reacbe<i by students who have participated in pr:>grams or completed courses of study. 
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ance can be used for individual selection, placement, 
or credentiaUng decisions. This function of testing 
has a long historical tradition: the earliest recorded 
examples are Giinese civil service qualifying tests 
given in the 2nd century B.C. As discussed in grepicr 
detail in chapter S, many European and Asian 
C( untries continue to use examinations primarily for 
professional and educational ''gatekeeping" Amc- 
tions, such as certifying students as qualified to 
attend specialized or elite public education pro- 
grams. 

Placement and certification decisions are still 
quite commonly based on tests, even in elementary 
and secondary education. Minimum competency 
examinations are required in many States for high 
school graduation, for promotion from one grade to 
the next, or for placement ui remedial or gifted 
programs;^^ Advanced Placement examinations are 
used to determine whether high school students will 
be given college credit and placed in advanced 
courses when they arrive at college; and the National 
Ibacher's Examination is necessary for teacher 
licensing in 35 States. 

In the United States, however, the use of tests for 
selective admissions decisions has been more lim- 
ited than in most other countries. it is rather at the 
end of high school, when students compete for 
admission to coUeges and universities, that selection 
tests play a critical role.*'' 

Some recent proposals to initiate new tests at the 
national level include provisions for placement and 
certification. One such proposal calls for a ''certifi- 
cate of initial mastery,'' to be issued to giaduating 



high school students who perform at prescribed 
levels on the test, and for examinations as certifica- 
tion criteria for completion of fourth and eighth 
grades.^® 

In contrast with tests used for system monitoring, 
tests used for selection, placement, or certification 
decisions must: 

• provide individual student scores; 

• meet particularly high standards of comparabil- 
ity, consistency, fairness, and validity; 

• provide information that is demonstrably rele- 
vant to successful performance in future school 
or work situations (in the case of selection 
tests); and 

• provide information that is demonstrably rele- 
vant to the identification of children with 
special needs (in the case of placement tests 
used for gifted and talented programs, remediid 
education, or other special K-12 situations). 

These tests are similar to system monitoring tests 
with respect to the need for impartial scoring, 
standardized administration, generality of informa- 
tion, and frequentvy of testing. 

Some proposals for a new national test or system 
of examinations have selection or certification as a 
principal function. Good tests for these purposes 
must undergo intensive and time-consuming devel- 
opment as well as careful empirical evaluation. They 
must be carefully and clearly validated for these 
intended purposes. Historically, tests used for these 
purposes have been the most isubject to legal 
challenges and scrutiny (see chs. 2 and 4). 



i^Tbere is widespread coocem about tests beliis used as the principal basis for placement of ^Iiildien into special programs, such as ''gifted and 
talented* * or remedial. ' 'A ou^or problem is getting students who obviously need it into either gifted or remedial programs when they do not meet the 
'required* minimum or maximum score on the tests [to qualify for State funding]/* said Jack Webber, a sixth grade teacher in Redmond, WA (personal 
communication, September 199 1). Precise o^U on the numbers of schools or dLitricts that rely on tests for these purposes, and on exactly how test data 
enter into those decisions, are difficult to find. Recendy the New York State Conunissioner of Education struck down the use of achievement tests as 
the sole screening criteria for placement of studenTs in "enriched** programs. See also discussion in ch. 2. 

i^Tlie situation has changed since the turn of the century, ^^len, e.g., "... a student could not be admitted to Central [High School] without 
demonstrating academic competence on an entrance exam. . . /* See David Labaree, The Making of an American High School: The Credentials Market 
and the Central High School cf Philadelphia, 1S38-1939 (New Haven, CT: Yale Univenity Press, 1988\ p. 50. This was not a phenomenon Umlted 
to the Bast Coast: rural students in Michigan and elsewhere in the Midwest needed to pass entrance examinations to gain admissions into urban high 
schools. Since that time, however, policies of selective admissions into public high schools have disappeared in all but a handful of special institutions, 
such as the Bronx High School of Science in New Yoik. 

i^Over 3,000 collies and universities Ui^e the SchoUistic Aptitude Tbst (SAT) or American College Ibst (ACT) to aid in their selection from vast 
numbers of applicants, and recruits take the Arased Services Vocational Aptitude Battery (AS VAB) for placement within the military. Many private 
elementary and secondary schools use tests as a criterion for admission. 

I'Forasummary of national testing pioposalsas of early 1991,8ee James Stedinan, Congressional ResearchService, ''Selected National Oiganizations 
Concerned With Educational Ibstiug Policy, * * memorandum, Feb. 8, 1991 . Fbr a more recent update and discussion of the central issues, see ' 'National 
Ifesting: An Overview,** Youth Policy^ vol. 13, Nos. 4-5, special isst% September 1991. pp. 29-35. For a critique of these proposals see also Madau5 
O Kellaghan, op. cit.« footnote 1 1 . 
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The United States ranks high In the world In terms of the percentage of the population graduating from high school. These students 
were photographed during their 1991 graduation cereniony at Woodrow Wilson High School, a large public high school in the 
District of Columbia During the 19708 and 19808 many States lnsUtuted.mlnlmum competency testing 

as a criterion for graduation. 



Raising the Stakes 

In theory, educational tests are unobtrubive mstru- 
ments of estimation. A major sticking point in any 
discussion of testing, hov/ever, is whether, in 
practice, testing affects the behavior it is intended to 
measure. In the current debate, advocates of new 
ways to test often argue that since tests can play a 
powerful role in influencing learning, they must be 
d* ^signed to support desired educational goals. These 
advocates disparage ''teaching to the test*' when a 
test calls for isolated facts from a multiple-choice 
format, but endorse the concept when the test 
consists of ''authentic" tasks. For these educators, 
one of the main criteria for a ' 'good' ' test is whether 
it consists of tasks that students should practice. 

More traditional measurement theorists, on the 
other h<tnd . are skeptical about the value of teaching 
to the test because of the need to obtain valid and 
reliable information about the whole domain of 
knowledge, not just the sanq)le of tasks that appears 
on the test. Thus, they argue that, regardless of a 
test's format, test scores are meaningless if students 
have practiced the tasks. 

The core of the often shrill debate reflects 
positions on two central questions: 

ERLC 



• Do conventional standardized tests designed to 
estimate student achievement negatively influ- 
ence instruction and learning? 

• Do new testing methods designed to guide 
instmction and teaming accurately estimate 
student achievement? 

Tests and Consequences 

As the Nation's use of standardized tests has 
increased, the consequences attached to test results 
have become more serious. All but four States have 
standardized testing programs. Ibst scores are ap- 
plied to a wide array of decisions affecting individ- 
ual children, schools, and school systems. Students 
who have taken college entrance examinations, high 
school juniors who have failed State minimum 
competency tests, schools that have become lures in 
real estate advertisements, and States that have 
found themselves ranked in the national media by 
their average test scores are likely to remember the 
event— and its consequences — ^long afterwards. 

Many educators, extrapolating from their experi- 
ences in classrooms as snidents or as teacheis, 
contend that td^is influence students and teachers 
only if they perceive that important consequences 
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are linked to test results^' But a fundamental 
problem arises when in^rtant consequences^ or 
high stakes, are attached to test results; and not 
surprisingly, the increase in high-stakes testing over 
the past two decades has brought a concomitant rise 
in controversy. lb understand the problems that can 
arise from high-stakes testing it is useful to consider 
a familiar medical metaphor. 

Fever thermometers are used to measure body 
temperature without influencing that t^pt^ture; 
they provide information that could lead to treatment 
of the underlying conditi'^ i^- suspected of causing 
the fever. Similarly, well-designed educational tests 
can provide useful information to help students, 
teachers, or even school systems. Ibachers can use 
tests to gauge their students^ progress and decide 
how to ''treat'' children who are not doing well; 
students (in the upper grades especially) can review 
their test results to see whether they are learning the 
material and to determine how they might learn it 
more effectively; and State funding authorities can 
use information on the relative progress of students 
in different schools to develop responsive educa- 
tional strategies. Thus, the information from tests 
can be used to choose appropriate educational 
'^treatments/' 

Suppose, however, that patients were punished for 
running a high fever (or rewarded for a low one), or 
that doctors were rewarded for bringing down ttieir 
patients' fever (or penalized if the fever remained 
high). They could easily take actions— cold show- 
ers, aspirin, a glass of cold beer-— to "cive" the 
symptom but not necessarily the underlying illness. 
More comprehensive and appropriate treatment 
could be delayed or skipped. Just as temporary drops 
in body temperature could give misleading indica- 
tions of changes in health status, fluctuations in 
scores from high-stakes educational tests may not 
reflect genuine changes in achievement. When 
stakes are high» a heavy emphasis is sometimes 



placed on specific test results, and especially on 
increasing scores. The symptom — ^low test scores — 
is treated widiout affecting the underlying condition — 
low achievement. 

An instructive lesson about the mixed effects of 
high-stakes testing comes from the minimum com- 
petency testing (MCT) movement of the 1970s and 
1980s (see box 1-C). As described also in greater 
detail in chapter 2, many State legislatures pegged 
promotion, placement, and graduation requirements 
to performance on criterion-referenced tests. The 
underlying rationale was that extrinsic rewards and 
sanctions would induce students to leam the relevant 
material more diligently and heighten teachers' 
motivation to ensure that all students learned the 
basics before moving them ahead. It now appears 
that the use of these tests nusled policymiakers and 
the public about the progress of students, and in 
many places hindered the implementation of genu- 
ine school reforms. 

More recent research seems to confirm that 
high-stakes testing can mislead policymakers.^ 
Complicating this picture, however, is other prelimi- 
nary research evidence suggesting that students may 
underperform on tests that bear no individual 
consequences at all.^^ If such distortions are occur- 
ring, they may be misleading policymakers and the 
general public into believing the schools are in 
worse shape than they really are (and into blaming 
the school system for a long list of social and 
economic problems^^). The fine-tuning knob diat 
could adjust tests to provide just the right degree of 
incentive to students — enough to elicit their best 
genuine performance — has not been invented. 

Test Use 

One of the most vexing problems in testing policy 
is how to prevent test misuse, principally the 



J'Sec. for eumple, Lauren Resalck, professor, University of Fittsbiwgh, testimony before the U.S. Congress, Senate Committee on Labor and Human 
Resources. Subcommittee on Education, Arts, and Humanities, Mir. 7, 1990. 

^See, e.g.. Daniel Koretz, Kobeit Linn, Stephen Dunbar, and Lorrie Shepard. ''Hie Effects of High Stakes Ibsting on Achievement: Preliminary 
Findings About Oeoeralization Across Ibsts,** pnptt presented at the annual meeting of the American Edu< ational Research Association, Chicago, VU 
April 1991; and Thomas Haladyna« Susan B. Nolan, and Nancy S. Hass, ^'Raising Standardized Achievement Tbst Scores and the Origins of Ibst Scoit 
Pollution,* * Educational Reuarcher. vol. 20. No. 5, Juno-July 1991. 

2)See, e.g., Steven Brown and Herbert Walbeig, University of Iliinois at Chicago, ^'Motivational Effects on Ibst Scores of Elementary School 
Students,*' monograph, n.d.: and Paul Burke, "You Can Lead Adolescents to a Tfest But You Can't Make Them Try," OTA contractor report, Au2. 
14.199L 



^ '•^Sec, c.g„ Gark Kerr, "Is Education ReaUy All That GuUty?" Education Week, vol. 10, No. 3, Feb. 27, 1991, p. 30. 
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Box 1-C-- The Minimum Competency Debate 

The American public sdKxrf i^stem is often accused of bdng resistant to clunge. It is coounon to liear ihetoric 
accusiqg dassiooms of being virtually indistinguishable fiom tliose of SO yean ago. In feet, though. American 
schoob have been changing since die veiy inception oftheconmK» school in the eariy 19th centuiy.^ 
historian and policy analyst, cittog die multiple waves of lefofm of cuiricuhun, instiuctional mediods. and 
classroom technology, aigues diat American schools are "awash widi innovation."^ But he questions whedm fliese 
technological and institutional innovations affect tfie "... core technology of die enteiprise— processes of teaching 
and leandng in classrooms and schools."^ 

The question of whedier hmovation is always a good diing for schools helps fitame a discussion of minimum 
competency testing (MCI), cleaily an histitutional innovation of rn^jor propoition. Its "k^ demand," as one 
commentator has written, . . was diat no student be given a high school diploma widiout fint passing a test 
showing durt he could read eveiydi^ English and do simpk aridmietic. Ftom its beghminp 
districts in die bte 19708 (Denver's program actually began in 1962), MCT spread rapidly, widi die biggest 
expansion occuiring between 1975 and 1979. By 1980, 29 States had implemented legislation diat required students 
to pass criterion-referenced examinations, and 8 more had such kgisbtion pending.^ SkHoe States used die 
examinations to detennhie eligibiUty for remedial prognuns and promotions and some req^ for graduation. 
By 198S. growth in such programs had leveled off, aldiough 33 States were still mandaimg statewide MOT; 11 of 
these States required die test as a prerequisite for graduation.* 

Ahhough diere is vehement debate about dw effects oliACT (and of high-stakes testing in general), tfiere is 
general agreement on die origins of MCT. As one of iu more ardent proponents has written: 

... this movemenr ... was, in esieoce, a popular upriiing ... deinandled] 

about tbe f»ct that mOlioiu of tbeir children were gradtudng from hi^ sdiool without die cooBctencc to go to the 

giooeiy store with a shoppiiig list and come back with the right it^ 

to change diit, and convinced diat a requited exit test would prodaoe die result dw^ 

'Hie fniMitioo of (he ichool lyitem from one lervidng the elites to oae ufbiot to univenal acceti i« dejcribed in miny hlitoriei of 
American edacatlaa. See, Iia ixalaMlaoQ and Mtopiet Wek, Sekoottngfor AO (New Yoik, NY: Baiic Books, 1985); David TVack. The 
0'^^^»»;^if^ni Mrt<i Am^ VrimBitM:aaoH (CanMdfe. MA: HmranlUiiWcnlty fteu. 1975); Mkhael B. K$»z, The iron, cf 

in Amrioan Education J87&!9S7(fkmYoa^m:Sbitt6B<^ >7 t 

^W«*«*Blinoie,"PawdoaofIiiiimilootoB^ 
Confetence on Wmdai a eiKal Qaertoaa of InnovaHoii, Ctovmion Ceatef. Bake Unhenl^. hby 1991. 

..^ ^ *e immatkio qneMiao in edncatloa See. e-g^ Rkhaid Nelicn and Richaid Muinane. 

TbchoiqaM an IMt: Hie Caae of Edncatioo," Jouiml cfBcoimde Behavior and Organixadoiu vol 5. 
1984, bM5>373; or Lany Caban, Ttadtert aitdMacMnet: Hie Clasnom Uie cfTkchnology Since 1920 (New Yodt, NY: Ibachen CoUwe 
FKsi, 1986). 

^ariMralxner. "Good Newi About AmeticanBducatioa," Commentary, vd. 91, No. 3, Maidi 1991, p. 21. 

. ^i?*?*J!!J?*T* C'«»P«^ Statu, and Poiwdal." The Future ofTluting, Baibcni S. Flake and Joseph C. Witt 

(edf.)(HUlMlale,NI:L.BflbanmAiaociaief. 1986), pp. 88-144. 

•U.S. Coqgieia. Office of ftdatologjr Aneanient, "Slate Educational 1bttii« Practicei," Iwckgroand paper of the Science. Education 
and IVantportatioa Progiam, December 1987. 

_ ^. ^'^•<V <!it.,eoo(nole4.p.21.SeealioDoaglaaA.Archbald.Univerrity 

Madlion, 'AReliospectlveandanAnalyiiiof tbeltolea of Mandated Tbitlng in Education Refo^^^ 1990. 

Continued on next page 



appUcation of a test to purposes for which it was not case of test misuse? First, the SAT is designed to 
designed.23 A familiar case of test misuse is the ranlc applicants from diverse educational back- 
ranking of State school systems on a "wall chart" grounds with respect to their likely individual 
displaying average scores on die Scholastic Aptitude performance as college freshmen. It is designed 
Tbst (SAT) along with other data.^* Why was this a specifically to override differences in curricula, 

23See also Buike, op. cit.,footoote 21; Larry Cuban, "TheMlsujeof Ifcste inEducatlon." OTA contractor report, Sept 9. 1991; Robert L. Linn, "Hat 
Minue: Why is it so Prevalent." OTA contractor report, September 1991; and Nelson L. Noggle, "The Misuses of Educational Achievement Ttsts for 
Grades K-12: A Perspective," OTA contractor report, October 1991. 

J 2«The wall chart, now defunct, was initiated in 1984 by then Secretary of Education Tfcrrell Bell. 
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Box 1-C— The Minimum Competency Debate-Continued 

As with every other range of testing in American education liistoiy,* MCT was quickly shrouded in 
controvony. Educators and measurement specialists warned against the quick-fix mentality that exit tests could 
solve the probkms stemmhig ftom a complex w«b of home, school, and societal decay; teachen lamented this new 
intnision in their classrooms; and minority advocates chaUecged the l«gal and ethical basis iinr what appeared to 
be the latest obstacle to die educatioaal and economic weU-being of dieir ddldren. 

What have been the effects of MCT7 The research community nmahis divided: there is coa^^ 
MCT influenced education, but disagieemem over whether it hifluenced educatioo for the be^ 

Challenged to show that MCT worked, its supporters lil» to point to trends in achievement test scores: the 
apparent inq[novement in literacy and nummcy among students geoenlly, the shdnldiig of die gap between white 
and minocity students, and the iqjturo in Schciastie Aptih^ 
had its most direct efiK«ts on hi|^ school juniors and senion, proponents cfadm that 
lower gnides too, where students heard the niessage that they wouU need to wodc haider hi Older to be pi^^ 
and eventutdly graduate. Thus, they credit MCT even with the upturn fai standardized test scores in the elementary 
grades. 

Odier anafysts dismiss tfiese conchisions. First, test scores went up even in States without MCT programs, 
undermining the causal relation between MOT and achievement.' Second, even in States with MCT where scores 
did go up, the timing of tfiese events raises hnpoctam questions. A 1987 congressional study noted that: 
of the increase in competency testing occuned . . . several years after tfw iqMnm in achievement first became 
apparent in die l<mer grades.'''®T1ie rqrart showed tfutt •chievement scores prob^ 
fifth graden in 197S. Thus, unless one is willbig to believe that tests can have virtudly his^^ 
achievement, die tindng of the rise hi sr c(iraiotbeattributedtoMCCThini,diecfaai|gehiSArsooiesbeghinteg 
in 1979 reflects tlw general hnprovem npeifonnancevecordedbytfitf cohort of test taism all tfiroughtfieir 
school yean, and not die advent of MCi: As one analyst put it: "... die higher scores rolled tfuoug^ tfie grades 
lilce a lippUng wave as die elementaiy schoddiildven got cMa"^^ 

Finally, what about die observed hnprovemenU hi National Assessment of Educational Fitogi^ss (NAEP) 
scores? First. NAEP scares did rise hi tfie 1970s and 1980s, but die rise actually bogan as eaily as die 1974 
assessment, well before MCT was hi operation hi all but one or two Stttes. Second, analysts point out dut while 
test performance amoQg Black and Hispanic 17-year-olds hnproved markedly during die 1970s and 1980s, it would 
be misleadhig to infer dut die gap between white and Bbck students had disappeared: . . white students 
constituted die great majority of students hi die two highest categ<»ies [suggesting] dutt there is still a substantial 



*SMefa.4. 

'See OetaM Biac^. f^dador to BaibMi Lener. Commatary, v<d. 92, No. 2. Angnit 1991, p. 10. 

lODMiei Kono, Educational AcMevmatt: Explanatioiu and ImpticaHons ofRtetnt Trtndi (WMlili«toa. DC: CongKitloiial Bwlget 
Oflice, Angoit 1987), p. 84. 

'ifincey, op. cit., fooowte 9. 



instruction, and academic rigor tfiat may exist in the 
thousands of high schools from which applicants 
have graduated; by design, therefore, it does not 
measure a student's mastery of any given curricu- 
lum, and therefore should not be used to gauge a 
school's effectiveness at delivering its curriculum. 
Second, the SAT is taken only by about one-Uitrd of 
all students liationwide (with considerable regional 



variation), so it provides a very inadequate measure 
of the quality of education offered to all the students 
in a State.^ 

There is considerable professional agreement 
about a number of principles of good test develop- 
ment and appropriate test use. The primary vehicle 
for enforcing these principles is self-regulation by 



^For ditcussioo of these and other pioblenu in using the Scholastic Aptitude Ifest as an iiKlicator of State educational programs, see Cuban, op. cit, 
O note 23; and Harold HodgUnson, "Schools are Awful— Aren't They?" Education Week, vol. 11, No. 9, Oct 30, 1991, p. A32. 
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gi^ between the reading proficiency of (he average Black or Hispanic 17-year-old and die average white 
17-year-old."^2 Third, there is a widespread fear that with its emphasis on basic skills. MCT forced many schools 
to cut back on faistniction in so-called "tu^iet (»der" skills.'^ 

But die debate over die effects of MCT goes well beyond trends in test scores. «iiic*» are always difficult to 
attribute to any single policy or intervention. Ftoponents look at die test scores and seeaglass^ 
a lefoim policy dut woiked fx basic skills and could now be successMy applied toward die goal of teaching moie 
children higher order skills. By and laige. diough, diere is considerable agreement diat State-nundated testhig, and 
MCT in particular, had u n inte n ded effects on classroom behavior of teachers and students, and diat diese effects 
should serve as a wamhig for rny fliture anticipated uses <rf hij^-stakes tests. 

Pot example, one studty combined analysis of survey data and intensive interviews widi teachers and school 
adnunistrators, and concluded dut die testing reinfofced an excessive emphasis on basic skills and stymied k)cal 
effdits to upgrade die content of education behig delivered to all students.^^ Odier studies have bemoaned die 
narrowing ;;ffect diat MCT seems to have had on instructional strategies, content coverage, and course offerings. " 
Still odicr studies focus on die potential^ misleadiqg hifonnation derived horn higjh-stakes tests: recent research 
suggests dut improvements on high-stakes tests do not generalize well to odier measures of achievement in die same 
dcMuafai; and studies dut focus in paiticulr - la teachers in districts widi high-stakes testing conditions— such as 
minimum competency tests, sdiool evahution tests, or externally developed course-end tests— demonstrate a 
greater influence of testing on curricuhun and instruction.^^ 

In die end. dien, diere appears to be consensus dut innovation in school testhig policies can have profound 
efifectft— die disagreement is over die desirability of diose effiects. Aldiough some of die evidence is contradictoiy. 
at times evoi conftising. one diing is clear, test-based accountability is no panacea. Specific proposals for tests 
intended to catalyze school improvement must be scrutinized on dieir individual merits. 



l^tobert Um and Stephea Dwte, ' 'The Natton't 
l>«to ir<)|p/WR, v(d. 72. I>fo. 2, October 1990. p. 130. For diics^ 

in SMKKog: Are We MimKliqg die Flodiiigi?" PM Dtlu Kappan, v<ri. 68. No. 6, FUbmMy 1987. pp. 424-430. 
^^itihouMbe noted. howem.fluatheenipiricridMaonth^ 

e.g., EliabeA Witt. Myv«|iee nn, and HI). Hoover, " 
What Level! of SldnOniitexi97"pqierpfeMi^ 
1990. 

D. Oiriiett and B. WilMo. ' 'Uninteoded aal Itawel^ 
of die American BducatkNiilRfiie^ 

levlew lad diiCMiioo^ lee AidteM iod Porter, op. oit, fboltioie 7. 
l^adelKoretz, Robot Lion* StepbenDuDlMr, and Lorrie Sliepnl, "He EffecU of IK^ Stiiket Tbttiog oa Achievement: IMiminsiy 
Findh^f About Oenenlizaiion Acfon Ibtti,' ' pqier presented at the anmid meeting of the American Educational Research Association* 
Chicago, nu kpA 1991, p. 20. 

l^ClaiieRottertergand Mary Lee Smith, ''UninteodedEffi^ 
meeting of die American Edncationd Rerafch Association, Boston, ^ 



test developers and other trained professionals.^^ 
Standards and codes developed by professional 
associations, critical reviews of tests, and individual 
professional codes of ethics all contribute to better 
testing. But, in general, few safeguards exist to 
prevent misuse and misinterpretation of scores. 



especially once they reach the public domain. Many 
professionals in the testing community also believe 
the codes lack enforcement mechanisms. Moreover, 
there has recently been heightened concern among 
test authors and publishers that market forces may 
interfere with good testing practice. As one test 



ERIC 



2«An example of self regulation oftca cited in die testing community is a decision talcen by die Educational Tfcsting Service (BTS) conceming die 
National Ibachers Examination (NTB), which is designed to certify new teachers. When die Governor of Aransas signed a bill in 1983 requiring teachers 
to pass die test in order to keep dieir Jobs, ETS President Gregory Anrig protested: • 'It is monOly and educationaUy wrong to tell someone who has been 
judged a satisfactory teacher for many years diat passing a certain test on a certain day is necessary to Iceep his or her job/ • BTS announced it would 
no longer sell die NTB to Slates or school boards diat used it to detennine die ftitures of piacticlng teachers. See Edwaid Rskc, • 'Tfest Misuse is Charged," 
New York Times, Nov. 29, 1983, p. CI; also David '^wen. None of the Above (Boston. MA: Houghton Mifflin, 1985), pp. 243-260. 
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author has warned: **. . . new corporate managers 

• . • [are] rushing to produce tests that will ostensibly 

meet purposes for which the tests have never been 
intended; 

New Testing Technologies 

Educators dedicated to t^e proposition that testing 
can be an integral part of instruction and a tool for 
assessing the fiill range of knowledge and skills have 
given impetus to new efforts to expand the technolo- 
gies, modes, formats, and content of testing. Ibst 
developers and educators are experimenting with: 

• performance assessment, a broad category of 
testing methods that require students to create 
answers or products tiiat demonstrate what they 
are learning, and 

• computer and video technologies for develop- 
ing test it^s, administering tests, and structur- 
ing whole new modes of content and format. 

This section of the summary begins with an 
overview of the characteristics of these new ap- 
proaches to assessment, and then considers their 
potential role in advancing the three basic functions 
of testing. It is important to remember that: 

• new assessment methods alone cannot ensure 
consensus on what children should leam or the 
levels of skills children should acquire, 

• curriculum goals and standards of student 
achievement need to be determined before 
appropriate assessment methods can be de- 
signed, and 

• new assessment methods alone do not necessar- 
ily equip teachers with the skills necessary to 
change instruction and achiev^^ now curricular 
goals. 

PerformaMce Assessment 

The move toward new methods of student testing 
has been motivated hy new understandings of how 
children leam as well as by changing views of 
curriculum. These views of learning, which chal- 
lenge traditional concepts of curricula and teaching, 
also challenge existing methods of evaluating stu- 
dent conq)etence. For example, it is argued that if 
instruction ought to be individualized, adaptive, and 
interactive, then assessment should share these 
characteristics. In general, educators who advocate 
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Performanoo assessment covere a broad range of 
testing methods that require students to aeate answers or 
products to demonstrate whatthey are learning. In this art 
assessment, students record their observations as they 
sculpt with day; the finished product and their notes will 
beconrie part of their portfolio for the year. 

performance assessment believe testing can be made 
an integral and effective part of learning. 

One type of performance assessment uses paper- 
and-pencil methods such as ''constnicted-response'' 
items, for which students produce their own answers 
rather than select from a set of choices. Other 
approaches take performance assessment further 
along the continuum—from short-answers at one 
extreme to live demonstrations of student work at 
the other (see box 1-D). Under ideal circumstances, 
these methods share the following characteristics: 

• they require studentit to construct responses, 
rather than select from a set of answers; 

• they assess behaviors of interest as directly as 
possible; 

e they are in some cases aimed at assessing group 
performance rather than individual perform- 
ance; 

• they are criterion-referenced, meaning they 
provide a basis for evaluating a student's work 
with reference to criteria for excellence rather 
than with reference to other students' work; 

• in general, they focus on the process of problem 
solving rather than just on the end result; 

• carefully trained teachers or other qualified 
judges arc involved in most of the evaluation 
and scoring; and 



Q '^udman« op. cit, footoote 8« p. 6. 
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1-D— The Many Faces of Performance A^essment 

Peiforniance assessment is a broad teim. It covers many dijffevent types of testing methods diat lequiie students 
to demonstrate their competencies or knowledge by creating an answer or product It is best undentood as a 
continuum of fonnats that range from the simplest student-constructed responses to comprehensive demonstrations 
or collections of large bodies of work over time. This box describes some common forms of performance 
assessment 

Constructed>response questims require students to jnoduce an answer to a question rather than to select from 
an array of possible answers (as multiplo-choice items do). In constructed-response items, questi(»s may have just 
<me conect answer or may be vaom open ended, allowing a range of responses. The form can also vary: exanq>le$ 
include answers supplied by filling in a blank; solving a mathematics problem; writing short answers; completing 
pro^ (drawing on a figure like a graph, iUustration, or diagram); or writing out aU the steps in a geometry 

Essays have Iwig been used to assess a student's understanding of a subject by having the student write a 
description, analysis, explanation, or summary in one or more paragra}^. Essays are used to demonstrate how well 
a sbident can use facts in context and stnicture a coherent discussion. Answering essay questions effectively requires 
analysis, synthesis, and critical thinking. Grading can be systematized by having subject matter specialists develop 
guidelines for responses and set quality standards. Scorers can then compare each student's essays against models 
that represent various levels of quality. 

Writing is the most ccnunon subject tested by performance assessment methods. Although multiple-choice 
tests can assess some of the componmts necessary for good writing (spelling, grammar, and woid usage), having 
students write is cousidovd a more comprehensive method of assessing compositim skills. Writing enables 
students to demonstrate composition skills— inventing, revising, and cleariy stating one's ideas to fit die purpose 
and the audience— as weU as their knowledge of language, syntax, and granmiar. There has been considerable 
research on die standardized and objective soning of writing assessments. 

Oral discourse was the earliest form of peribtmance assessment Before paper and pencil, chalk, and slate 
became affordable, school children rehearsed dieir lessons, recited their sums, and rendered their poems and prose 
aloud. At the university level, rhetoric was interdisciplinary: reading, writing, and sp^ddng were the media of pubUc 
affairs. Tbday graduate students are tested at Uie Master's and Ph.D. leveb with an oral defense of dissertations. But 
oral interviews can also be used hi assessments of young chUdien, where written testing is inappropriate. An obvious 
example of oral assessment is in foreign languages: fluency can only be assessed by hearing tiie student speak. As 
video and audio make it possible to record performance, the use of oral presentations is likely to expand 

Exhibitions are designed as comprehensive demonstrations of skills or competence. They often require 
shidents to produce a demonstration or live performance in class or before other audiences. Tfeachers or trained 
judges score performance against standards of exceUence known to aU participants ahead of time. Exhibitions 
require a broad range of competencies, are often interdisciplinary in focus, and require student initiative and 
creativity. They can take Uie form of competitions between faidividual students or groups, or may be collaborative 
projects that students work on over time. 

Experiments are used to test how well a student understands scientific concepts and can carry out scientific 
processes. As educators emphasize increased hands-on laboratory work in tiie science curriculum, Uiey have 
advocated the devetopment of assessments to test tiiose skills more directiy tiun conventional paper-and-pencil 
tests. A few States are developing standardized scientific tasks or experiments that aU students must conduct to 
deraonsttate understanding and skiUs. Developing hypotheses, planning and carrying out experiments, writing up 
findings, usmg Uie skills of measurement and estimation, and applying knowledge of scientific fects and underlying 
concepts— in a word, * 'doing science"— are at die heart of diese assessment activities. 

Portfolios are usuaUy files or folders diat contain coUections of a student's woik. Uiey furnish a b^ad portrait 
of individual performance, assembled over time. As students assemble dieir portfolios, diey must evaluate their own 
woric, a key feature of performance assessment. Portfolios are most common in writing and language arts— showing 
drafts, revisions, and worics in progress. A few States and districts use portfolios for science, madiematics, and die 
arts; odiers arc planning to use diem for demonstrations of workplace readiness. 
SOURCE: Office of Ibchnology Auesaement, 1992. 
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• students understand clearly the criteria on 
which they are judged. 

Computer and Video Technologies 

Data processing technologies have played a 
significiant role in shaping testing as we know it 
today, and could be important tools for the develop- 
ment of innovative tests. Computers have most 
conmK>nly been used for the creation of test items 
and the scoring and reporting of test results. New 
computer and video technologies, however, used 
alone or in conjunction with certain types of 
poformance assessment, offer possibilities for en- 
hancing testing in the classroom. As con^uters have 
become more available in schools, their use for 
testing has become more feasible. Research in this 
field is showing promise in the following areas: 

• questions presented and answered on comput- 
ers can go beyond the traditional multiple- 
choice format, allowing test takers to create 
answers rather than select from altematives 
presented to them; 

• video, audio, and multimedia car. make more 
realistic and engaging questions and tasks 
available; 

• computer-adaptive testing can establish an 
.individual test taker's level of skill more 
quickly and, under ideal conditions, more 
accurately than conventional pi^r-and-pencil 
testing; and 

• integrated learning systems, already found in 
some classrooms, often come with testing 
embedded in the instruction and provide on- 
going analysis of student progress. 

Continued research combining computing power, 
principles of artificial intelligence, learning theory, 
and test design could yield significant advances in 
the fomi and content of assessment. But a set of 
impressive technological and economic barriers 
need to be surmounted: for example, the limited 
availability (and relatively higher cost) of hardware, 
compared to paper-and-pencil tests, has prevented 
more rapid innovation and adoption. And even with 
more hardware, there is no guarantee that the 
capacity of that hardware will be adequate to meet 
constantly increasing software requirements. An 
even greater barrier is the lack of communication 
between educators, test developers, and technolo- 



gists in achieving a consensus on the goals of testing 
and in shaping a vision for technology in the service 
of those goals. 

Using New Testing Technologies 
Inside Classrooms 

Performance assessment is not new to teachers or 
students; many techniques have long been used by 
teachers as a basis for making judgments about 
student achievement within the classroom. The form 
and complexity can vaiy: 

• Imagine yourself a rebel at the Boston Iba 
Party and write a letter describing what oc- 
curred and why. 

• Complete the following five geometry proofs. 

• Describe both the dramatic and situational 
irony in Dickens' Hard Times, specifically 
using the characters of the Ibacher, Mr. Mc- 
Cboakumchild, and the boss businessman in 
Coketown, Thomas Gradgrind. 

As illustrated in box l-E, what students produce 
in response to these testing tasks can reveal to the 
teacher more than just what facts they have leamed; 
they reveal how well the student can put knowledge 
in context Well-crqfted classroom performance 
tasks are useful diagnostic tools that can reveal 
where a student may be having problems with the 
material They can also help the teacher gauge the 
pacing and level of instruction to student responses. 
At their best, these tasks can be exciting learning 
experiences in themselves, as when a student, 
required to create a product or answer that puts 
knowledge into context, is blessed with that flash of 
inspiration, ''Ahal I see how it all comes together 
nowl" In addition, these tests can signal to the 
students what skills and content they should leam, 
help teachers adjust instruction, and give students 
clear feedback. 

Much of the research about learning and cognitive 
processes suggest important new possibilities for 
tests than can diagnose a student's strengths and 
weaknesses. Although traditional achievement tests 
have focused largely on subject matter, researchers 
are now recognizing that . . an understanding of 
the learner's cognitive processes — the ways in 
which knowledge is represented, reorganized, and 
used to process new information — is also needed. ' '^ 



^Robert L. Linn, * 'Barriers to New Tb$i Design,* ' The Redesign oflhstingfor the 21 st Century, proceedings of the 1985 ETS InvitaUonal Conference, 
^""lecn E. Ftetmm (cd.) (Princeton, NJ: Educational Tfcsting Scivicc. 1986), p. 73 
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Box 1-E— Mr. Griffith's Class and New Technologies of Testing: Before and After 

Tfo understand how teaching and tes^ 
of a fourth grade teacher's efforts to underetand his students' progress, and the lole standaidized tests play in that 
understanding. Wo stait with mathenutics. or, as it is Imown in most fourth grade classrooms, arithmetic. 

My- Griffith is woridng on fiactioos.AnK«g the 28 children 
teacher s pronps, and uroally have the answer right. Some of 

t comes to adding and subtracting fiactions. but a{>pear puzzled over the rules of multiplying. Hie nuSority appear 
lost when tt comes to division. Griffith has a sense of these dififerences tnsed on his constant interaction wiShis 
class, but he nerds more systematic infomiation to know how to adjust his lessons. 



Before 



For starters, Gnffith turns to his own tests, which are tighdy linked to his instructional objectives and to the 
material he has covered in class. He also assesses die children in odier ways: he checks tfieir woridxwks, caUs on 
aem to do problems at tiieblackboard, poses questions and invites answers, and 

in smaU groups. As an experienced teacher, Griffith can syndiesize his observations of childroi at woric into fluid 
judgments of tiieir strengths and weaknesses and go tfiat next vital step of adjusting his pedagogy accordingly. 

An additional "ouroe of infoimation is the summary of statistics bom last spring's administration of a 
^T^VTf "^^^""^ mathematics test. Rom fliese data. Griffidi could g^ a sense of how well die 
stttJents in his dMS stack up against otfiers in Ae school and evM to 

Jd he woric on hto fiactiMs WW the summed) He migte 

did very weU on tiie test but stiU gets stuck when she has to perform at the blacidward ««»n»cciass, 
of his students' Iwunfag needs or to stnicture his lesson plans. One p^ 

not even present for die spnng testing, and he has tio test data for tfiem. Anodier problem is tfut die standardized 
test scof^ do not dutmguish between fiactions and otfier appUcations of addition and subtraction. When OriffiUi 
^r^^ fiactions, diere is no guarantee diat die next topic on die curriculum will have been covered on die 
sifln(iflroi7iCcl test* 

It is not muAbrttCTwidi reading and writing. TTiech^ 
tests BtVPued Mie disttict stiU give passages out of context dutt have no meaning for many of die students . Anl 

even though Griffidi feels it is importam to have his students do as much writing as possibte^ 
questioii8« spdlingand voort^ 

the Important as diey are, diey do not inspire mJi enthusiaTta 

SS^sthSSt^iSlEiS^^ 
After 

Cons jd«r again die situation of Mr. Griffidi, our fourth grade teacher. In die last few years, his school has 
^T*""?* ^ technology. Each class now has several tTputers linked togedier in 
thf S?A^i^ "^^rS^l "^•T***"' J-npwg^ ««s curriculum taught in his school Money froin 
to t^JSif ,1 il?; 2 '"'^ **** ^ cmmcn and a VCR. which connect 

toatelevisiondiat had been locked hi die stoiage room untilafew yean back. OccasionaUyte 

teaching uridi computers and has grown pretty comfortable widi dieir use. especiaUy since he knows diat his 
coUeague. Mrs. Juster. a computer whiz. i» jast across die haU and willing to help him when he gets stuck. 

Mr. Griffith finds diat. as he uses tbese technologies for teaching, cooimon sense requires dut he use diem for 
testing as welL IJ» the teacUqg, die testing varies. Some of die testing he does is die same as before, but made 
!S ^ T** ^^IP of « software package, he can design his own short-answer, 

essay, or muItipkK;hoice quizzes geared to die material he has been teaching. He appreciates die fact diat die 

Continued on next page 
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Box 1-*E— Mr. Griffith's Class and New Technologies of Testing: Before and After--Continued 

software can automatical^ tmislate queiticfis into Spaniah^ so Maria and Esteban, who leoeotty anived from El 

Salvador, can take tests with the test of the chus. The childien say these tests are much easier to read than the 

handwritten ones he had to cnink out on the sdhool's ancient mhneograph nuichine. He 

recmtewitfi **gRKldKKA'' software durt automatically computes and u^ 

who is slipping hi thne for hhn to set up Ids tttde ' 'fiieside dhats ' ' widi students. 

But die real change has been in beu)ig able to Ihik his testing clos^ 
having his students do a lot of writmg on du ;vord processor. Now he has die students pass dieir writing around 
on die conqputer, make comments on eadi odier's worics, and save dieir first drafts. They seem more comftntable 
making revisions, and he can grade final (mductsdutt are li^ 
written work in electronic p(»tf(>Uos on disk; at die en^ 

out for inclusion in die portfdk) diey take widi diem to die fifth grade. Some» like Regine, have a hard time deciding 
what is best and ^y. She'd like to print it allt 

The mathematics diey have been woridng on is hicluded in die software hi the ILS : same old fractions and long 
divisi<XH-die material that Oriffidi has watched, over die years, turn some students off madiematics forever wt^e 
odiers just breeze dirough it Bm at least now he can getabetter handle on w 

Dana is no pioblrai— he has alreufy moved on to two* and three-digit kmg division. At die end of his wodc, die 
system inints out a rep(»t that shows he got aU 10 problems hi die mhd^eM 

Griffith makes a note to hhnself—'l^ve Dana ahead to die next unit on die program and see how he does. It's 
far better dian havhig hhn stariqg out die whidow while Fm gouig over die basks widi the other kids/' Michelle, 
who did fine widi muItq)lication , continues to have difficuUy in division pioUems. A quick printout of the problems 
she missed— widi die step-by-stq) procedure she fi>Uowedr--ieveal i that her problem lies hi subtraction— she keq>s 
forgetting principles of canyhig. * *Maybe I can get Brad to woric widi her on some of those jHoblems," be dunks. 
^'O^s, Brad is too much of a tease. Better ask Kevhi histead." 

Befofe it is time for die first gradhig period, Griffith prints a summaiy 
is stiU a huge range hi theu* skilU, eq)eciidi^ hi nutthematics. 
piograms, die cuiricuhmi can still be pretty deadly, Griflfidi knows. 
Mrs . Juster toM Mm about as ways to get his students more hderest^ 
(me about die abandoned beU to wer at die edge <tf to wn, hi whidhi die beU stafts 
mterest,'' he ddnks. They like woridng hi groups and digghig out die chm 

doing die madiematics to solve die problem mi^ put some of diese <fay madiematics facts into context Miybe. 

While fhey are watchhig die video, Oriffidi plans to get Elise, a student 
from a neighboring sdiool district, staited on die coiiqwter-adi^ 
she is quite fttf bdiind die odier students; dds wiU giv 

whedier she might benefit torn die Chapter 1 program hi die school ' ' Shoot, I hate to have her miss dut video, 
dicugh. I suppose I can see if she can stay after school and udx die test She'U 
be late pickuig up die baby at die day care center. And diendiere's the v^ 

and Sheni whh. They are woddng on a report on 'Why we need new phyground equifMnent* and interviewhig 
students phiying hi die schoolyard after schodl. I can see they'll need a lot of help widi diatf Whoever said 
technology makes teaddng easiet?" 

SOURCE: Fkdotud scenario prepitied 1992. 



New diagnostic tests, informed by cognitive science 
research, may help teachers recognize more quickly 
the individual learner's difficulties and intervene to 
get the leamer back on track. Similarly, computer- 
administered tests open up new possibilities for 



keeping records of a student's errors or ineffective 
problem-solving strategies, and for providing imme- 
diate feedback so that children can recognize then: 
errors while still involved in thinking about the 
questions.^ 



^See. for example, Isaac Bejar, * 'Educational Diagnostic Assessment.** Journal of Educational Measurement^ voL 21, No. 2, summer 1984, pp. 
O .189. 
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Using New Testing Technologies 
Beyond Classrooms 

Ibaching has always been an art more than a 
science, and what works in one classroom with one 
teacher does not easily transfer to other classrooms 
with other teachers.^^ Consequently, many of the 
methods used by teachers to gauge the progress of 
their students and adjust their lessims are not 
standardized. As long as teachers can correct their 
judgments on a continuing and fluid basis, day by 
day and hour by hour, teacher experimentation with 
a wide range of inferential assessment methods 
presents no particular harm and can offer many 
benefits. 

When judgments about student performance are 
moved outside the classroom, however, they must be 
comparable: **. . . whatever contextual understand- 
ing of their fallibility may have existed in the 
classroom is gone.*'^^ Using tests fairly and appro- 
priately for management decisions about schools or 
students, therefore, imposes special constraints. As 
explained in detail in chapter 6, standardization in 
test administration and scoring is the first necessary 
condition to make test results comparable. It is 
precisely the recognition that individual teachers* 
judgments may be insufficient as the basis for 
cmcial decisions affecting children's futures that 
historically has fueled public interest in standardized 
tests originating from outside the classroom or 
school.^^ 

It is important to recall that the basic concept of 
direct assessments of student performance is not 
new. American schools traditionally used oral and 
written examinations to monitor performance. It was 
*^he pressure to standardize those eiforts, coupled 
with the perceived need to test large numbers of 
children, that led eventually to the invention of the 
multiple-choice format as a proxy for genuine 
performance. Evidence that these proxies were more 
efficient in informing administrative decisions rap- 
idly boosted their popularity, despite their less 



obvious relevance to classroom learning. The mod- 
em performance assessment movement is based on 
the proposition that new testing technologies can be 
more direct, open ended, and educationally relevant 
than conventional tests, and also reliable, valid, and 
efficient. 

How can performance assessments and computer- 
based tests contribute to system monitoring and 
selection, placement, and credentialing decisions? A 
growing number of States are experimenting with 
answers to this question. Thirty-six States currently 
use writing assessments and nine others are planning 
to introduce writing assessment in the near future. 
Twenty-one States currently use oth^r performance 
assessment methods including portfoEos, constructed 
response, and hands-on demonstrations; 19 States 
plan to adopt some or all of these methods. Figure 
1-3 shows the current geographic distribution of 
States using writing and other performance assess- 
ments. Some States are using sampling technologies 
to reduce the direct costs of performance assess- 
ments and are seeking to resolve various technical 
problems. Most States are using these tests in 
combination with the more familiar multiple-choice 
test. 

Tb the extent that decisions about school re- 
sources could be based on these statewide assess- 
ments, they are potentially high stakes. Advocates 
maintain that performance assessments have a clear 
advantage over standardized multiple-choice tests, 
because they assess a wider range of tasks. Al- 
though these assessments do not necessarily 
provide different estimates of individual student 
progress than some conventional tests, many 
educators believe their advantage lies in their 
more obvious relevance to learning goals. The 
involvement of teachers in developing and scoring 
performance assessments is crucial to keeping them 
closely linked to curricula and instraction. 

Using performance assessments beyond the con- 
fines of classrooms raises a set of important lesearcb 
and policy issues: 
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»See Richard Mumane and Richard Nelson, ' 'Production and Innovation When Tkchniques are -ftcit: Hie Case of EducaUon,' • Journal of Economic 
JeAov/or afl</Or«an/Mftc>«, vol. 5, 1984, pp. 353-373; also Pauly, op. dt, footnote 1^^ 

"Stephen Dunbar, Daniel Korctz, and HX». Hoova, "Quality Control in the Development and Use of Performance Assessments," papa presented 
attheauDualmeetingofUieNationalCouncilonMe«suremcntlnEducation,Chicago,tt.,AprU1991,p. 1. 

Jilfdedsions about chUdrcn's future opportunities are at stake, then the tests must also dcmonstr^ "predictive validity," i.e iheymust 

provide reasonably accurate information about individual potential for future behavior in school, work, or elsewhere. For discussion of issues p«*taininK 
to the use of test scores in predicting future performance, see, e.g., Henry Levin, "AbiUty Tfcste for Job Selection: Are the Economic Claims Justifiedir 
^aili^ qf Opportunity. B. Gifford (ed.) (Boston, MA: Kluwcr, 1990); and James Crouse and Dale Trusheim, The Case Against the 

J4T (Chicago, IL: University of Chicago Ptcss, 1988). * 
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Figure l'*;i--StatewldG PerformancG Assessments, 1991 




Writing asaessment only (n-15) 
Writing and other types of performance assessments (n«21) 
None (n«14) 



NOTE: Chart Includes optional programs. 
SOURCE: Oftica of Tachnology Assaasmant* 1992. 

• The most common fomi of perfomiance assess- 
ment is the evaluation of written work: essays, 
compositions^ and creative writing have been 
widely used in large-scale testing programs. 
Other forms of peifomiance assessment are still 
in earlier stages of development and, though 
promising, require considerable experimenta- 
tioa before they can be used for high-stakes 
decisions. 

• If performance assessment is to be successfully 
adopted, continuing professional development 
for teachers will be critical. Most teachers 
receive little formal education in assessment. 
Performance assessment may provide a great 
opportunity for teacher development that links 
instruction with assessment. 

• Some parents and educators are worried that a 
move to greater use of performance assessment 
could have a negative impact on minority 
groups. It is critical that the issues of cultural 
influence and bias be scrutinized in all aspects 



of performance assessment: selection of tasks, 
adrninistration, and scoring. 

• Administration and scoring of performance 
assessment are both time consuming and labor 
intensive. If the time spent on testing is viewed 
as integral to instruction, however, new meth- 
ods could be cost*^ffective. 

Computer technologies, too, may play a powerful 
role in system monitoring and high-stakes testing of 
individual students. In particular: 

• Adaptive testing, in which the computer selects 
questions based on individual students' re- 
sponses to prior questions, can provide more 
accurate data than conventional tests, and in 
less time.^3 

a Advances in software could make possible 
automated scoring that closely resembles human 
scoring. 

a Large item banks made possible by advanced 
storage technologit.-; could lower the costs of 
test development by allowing State or district 



33For diftcusslon of the state-of-tha>art in computer-adaptive testing, see Bert F. Green, The Johns Hopkins University, **Computer*Based Adaptive 
O linginl99I/*monognvh,May9, 1991. 
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testing authorities to txp into common pools of 
questions or tasks. 
• With the combination of large item banks, 
computer-adaptive software, and computerized 
test administration, tests would no longer need 
to be composed in advance and printed on 
paper; rather, each student sitting at a terminal 
could theoretically face a completely individu- 
alized test. This could reduce the ceed for tight 
test security, given that most students cannot 
memorize the many thousands of items stored 
in item banks. 

An important policy question regarding comput- 
ers in testing is whether to invest in new technolo- 
gies for scanning hand-written responses to open- 
ended test items. Since more tests may one day be 
administered by conputer, investing in new scan- 
ning technologies could be wasteful. 

Special Considerations for System Monitoring 

Performance assessments and computer-based 
tests could be designed to provide information on the 
effectiveness of schools and school systems. As with 
all tests, though, the outcomes of these new tests 
ueed to be interpreted judiciously: the relative 
performance of schools or school systems must be 
viewed in the context of many factors that can 
influence achievement 

Because individual student scores are not neces- 
sary for system monitoring, innovative sampling 
methods can be used that offer many inportant 
advantages for implementing performance assess- 
ments. When sampling is used, inferences can be 
made about a school system based on testing either 
a representative subsanple of students or by giving 
each student only a sample of all the testing tasks. 
These methods can lessen considerably the direct 
costs of using long and labor-intensive performance 
tasks, allow broader coverage of the content areas 
that appear on the test, and still keep testing time 
limited. Furthermore, sampling methods provide 
important protection against misuse of a test for 
other functions (such as selection, placement, or 
certification), since students do not receive individ- 
ual scores. 

However, the use of sampling methods raises 
specific concerns: one issue is whether students* less 
obvious incentives to do well on such tests — given 
that no individual consequences are attached to 
O rformance — could lead to erroneously low esti- 

ERLC 




Photo credit: BM Corp. 



Computere can change testing Just as they change 
learning. Recent advances In computers, vktoo, and 
related techndogled could one day revolutionize testing. 

mates of aggregate achievement. A related issue is 
whether tests administered to samples of students 
will effectively signal to all students what they are 
expected to ^eam. A third questicm is whether it 
would be fair to administer new testing methods, 
intended as tools for enriched instruction, to samples 
of students rather than to all students. 

These issues warrant further research as a prereq- 
uisite to using new testing methods for system 
monitoring functions. 

Special Considerations for Selection^ 
Placement^ and Credcntialii;g 

New testing technologies have considerable po- 
tential to enrich selection and certification decisions. 
For example, portfolios of student work can provide 
richly detailed information about progress and 
achievement over time that seems particularly rele- 
vant and useful for certiHcation decisions. One 
example is the Advanced Placement (AP) studio art 
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examination, administered by the Educational last- 
ing Service (ETS)» which is based on a portfolio of 
student artwork. This examination is used to awurd 
college credit, and, as mh, certifies that a student 
has mastered the skills expected of a first-year 
college student iu studio art. 

Ibsts based on complex computer simulations of 
**on»the-job** settings are being developed for 
architecture, medicine, and other professions, as a 
basis for professional licensing and certification; the 
integration of graphics, video, and simulation tech- 
niques can create tests more closely resembling the 
actual tasks demanded by those professions. Al- 
though promising, these initial efforts have imcov- 
ered some technical issues that will require consider- 
ably more research before the tests can accurately 
and fairly assess the skills of interest, and be used to 
make high-stakes decisions about individuals.^ 

OTA has identified the following central policy 
issues concerning the design of new tests for 
selection, placement, and certification. 

Technical requirements — These tests must meet 
very high technical standards. Inferences drawn 
from them must be based on rigorous standards of 
empirical evidence not necessarily required of tests 
used for other functions. Because tests used to select, 
place, or certify individuals can have potentially 
long term and significant consequ'.^nces, their uses 
need to be limited to the specific functions for which 
they are designed and validated. Similarly, because 
test scores are only estimates, very high levels of 
reliability, or consistency, must be demonstrated for 
the test as a whole. Finally, because of the amount of 
day-to-day variability in individuals, no one test 
score should be used alone to make important 
decisions about individuals.^^ 



Generalizability — Another issue pertains to the 
content coverage of new assessment formats, such as 
exhibitions, portfolios, science experiments, or com- 
puter simulations. The advantage of these formats is 
in their coverage of relevant factors of performance 
and achievement; however, this usually means that 
only a few such long and complex tasks can be 
completed by a single child in Uie allotted time.^ 
Are inferences about achievement made on the basis 
of just a few tasks generalizable across th^ whole 
domain of achievement? When each child can 
complete only a few tasks, there is a much higher 
risk that a child* s score will be specific to that 
particular task. Selection and certification decisions 
cannot be made on the basis of these tasks unless 
results are stable and generali?able. 

Security — Currertly most high-stakes selection, 
placement, or certification tests are multiple-choice, 
and precautions are taken to keep items secret. 
Scores would be suspect if some (or all) test takers 
knew the items in advance.^*^ Given the relatively 
low number of performance-based tasks that might 
appear on some new tests, sharing of information 
&om one cohort of test takers to another could 
become a problem undermining uie test's validity. 
Computers with enough memory to accommodate 
very large item banks may provide some technologi- 
cal relief, aldiOi^gh the question remains open as to 
whether a sufficient number of different items could 
be written at reasonable coat. 

Fairness — ^Most previous legal challenges have 
targeted tests used to make significant decisions 
about individuals. Any test designed for selection, 
placement, or certification wiU be carefully scmti- 
nized by those concerned with equity and bias. 
Designiiig 3 performance-based selection or certifi- 
cation test will require considerable research to 
ensure elimination of bias. 



^Sce, for example, David B . Swanson, John J. Norcinl and Louis J. Grosso, ' 'Assessment of Clinical Competence: Written and Computer-Based 
Simulations/* Assessment and Evaluation in Higher Edu:ation, vol. 12, No. 3, 1987, pp. 220-246. 

33 An additional reason for insisting on high sUmdards is that high-stakes tests can lead inadvertently to the labeling of individuals — by themselves 
or by others — with unceitain and potentially harmful consequences. For discussion of these issues see, e.g., U.S. Congress, Office of Ibchuology 
Assessment, "The Use of Integrity Ibsts for Pte-En^Ioyment Screening/* background paper of the Science, Education, and Transportation Program, 
Scpteuiber 1990. 

^Increasing the time allotted to assessnrent does not necessarily imply reduced time for instruction, as long as the two activities are well integrated. 
But completely "seamless** integration of testing and instruction could raise problems of its own* such as potential infringement of students' rights to 
know whether they are being tested and for what purposes. 

^^The concept of "test openness" is controversial. Most traditional measurement experts argue that allowing students access to test items in advance 
would ^^^parably compromise the test*s validity. For opposing viewpoints, however, see, e.g., Judah Schwartz and Katherine A. Viator (eds.). The Price 
ofSei^recy: The Social, Intellecmi and Psychological Costs of Current Assessment Practice, A Report to the Ford Fbundation (Cambridge, MA: 
Harvard Graduate School of Education, Sq)temb^ 1990); and John Ftederickson and Alan Collins, "A Systems Approach to Educational Ibsting," 
O national Researcher^ vol. 18, No. 9, December 1989, pp. 27-32. - 
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Cost Considerations: A Framework 
for Analysis 

A common challenge posed to advocates of 
alternative assessment methods is an economic one: 
can they be administered and scored as efficiently as 
conventional standardized tests?^^ indeed, one of the 
attractive features of commercially published stand- 
ardized tests is their apparently low cost. As shown 
in box 1-F, OTA estimated outlays for standardized 
testing in a large urban school district were approxi- 
mately $1.6 million for 1990-91 ($0.8 million per 
test administration), or only about $6 per student per 
test administratiop 

But these outlays on contracted materials and 
services and district testing personnel do not tell the 
whole story. First, they neglect the dollar value of 
teacher time devoted to test administration. Because 
a teacher's many activities are not typically itemized 
on a school district budget, the costs associated with 
teacher time spent administering tests are less 
obvious than other testing expenses. But they can be 
significant: in the district studied by OTA, the 
portion of total teacher salaries attributable to time 
spent administering tests was roughly $1.8 million 
per test, or $13 per pupil. 

Another important con^nent of cost is the time 
spent by teachers in test preparation. This factor is 
more variable than administration time and is more 
difficult to estimate. It depends largely on the degree 
to v/hich teachers can distinguish their regular 
instruction from classroom work that is driven by the 
need to prepare students for specific tests. The 
question is whether the test preparation activities 
would take place even in the absence of testing: this 
issue hinges partiy on test content-4iow closely 
does the test reflect cunicular and instructional 
objectives?— and partiy on how individual teachers 
allocate tfieir classroom time across various activi- 
ties, including test-related instruction. (Tfests that are 
intended to be linked to instruction might not be 
perceived as such by some teachers, and tests that are 
apparentiy separate from regular instruction could 
be useful tools in tfie hands of otfier teachers.) In the 



district OTA studied, teachers reported spending 
anywhere from 0 to 3 weeks in preparing their 
ctudents for each test administration — at a cost as 
high as $13.5 miUion per test, or clnse to $100 per 
pupil.3^ 

Just as counting material and testing personnel 
outiays alone can lead to deceptively low estimates 
of tfie total resources devoted to testing, accounting 
fully for teacher administration and preparation time 
can lead to deceptively high cost estimates, lb 
correctly account for teacher time requires attention 
to tfie indirect or opportunity costs of tiiat time. An 
opportunity cost is defined generally as "... the 
value of foregone alternative action. ' "»« With respect 
to testing, analysis of opportunity costs focuses 
attention on the following question: to what extent 
does the time spent by teachers on preparation and 
administration of tests contribute to the core class- 
room activities of teaching and learning? 

If testing is considered integral to instruction, tiien 
teacher time spent on preparing students and on 
administering the tests has lower opportunity costs 
tiian if the testing has littie or no instructional value. 
To estimate the opportunity costs, then, requires 
information or assumptions about the degree to 
which any particular test is intended as an instruc- 
tional tool, and information or assumptions about 
the extent to which individual teachers use testing as 
part of their instructional program. 

As shown in box 1-F, some teachers in the district 
OTA studied spent as much as 3 weeks preparing 
shidents for each of tiie two standardized tests, plus 
4 days administering each test. The worst case would 
be one in which tfiis time was completely irrelevant 
to coursework: tfie district would have incurred 
steep opportunity costs— «bout $15 million per test, 
or close to $ 1 10 per pupil. Tne best case, in which all 
preparation time was relevant to coursework, would 
have cost under $2 milliun per test, or $ 13 per pupil. 

Thus, the total costs of a testing program consist 
of botfi direct and opportunity components: direct 
expenditures on materials, services, and salaries, and 
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indirect costs of time spent on testing activities.^^ 
For a graphical exposition of this concept, see box 
1-a 



Federal Policy Concerns 

Sev^ proposals now pending before Congress 
could fundamentally alter testing in the United 
States, Three issues aheady on Congress' agenda are 
proposals for national testing, changes to the Na- 
tional Assessment of Educational Progress (NAEP), 
and revisions to the program that assists education- 
ally disadvantaged children (Chi^ter 1). Federal 
action could also focus on ensuring the q>propriate 
use of tests, and speeding research and development 
on testmg. 

These policy opportunities combined with the 
current national desire to in^>rove schooling provide 
Congress with an opportunity to form comprehen- 
sive, coordinated, and far-reaching test policy. 
Rather than allowing test activity to occur haphaz- 
ardly in response to other objectives, decision- 
makers can bring these several concems together in 
support of better leaming. 



National Testmg 

As discussed in chapter 3, the past year has 
witnessed a flurry of proposals to establish a system 
of national tests in elementary and secondary 
schools. Momentum for these efforts has built 
rapidly, fueled by numerous governmental and 
commission reports on the state of the economy and 
the educational system; by the National Goals 
initiative ol the President and Govemors; by casual 
references to the superiority of examination systems 
in other countries (see box 1-H); and most recently 
by the President's America 2000" plan. 

The use of tests as a tool of education policy is 
fraught with uncertainties. The first responsibility of 
Congress is to clarify exactly what objectives are 
uuached to the various proposals for national 
testing, and how instruments will be designed, 
piloted^ and implemented to me^t these objectives. 
The tbllowing questions warrant careful attention: 

• If tests are to be somehow associated with 
national standards of achievement, who will 
participate in setting these standards? Will the 
content and grading standards be visible or 
invisible? Will the examination questions be 



addition to teacher time, tiliere are opportunity costs associated with student time: assuming that instructional time is an investment with economic 
returns, student time ^t on testing can be valued in terms of foregone future income. This follows a * 'human capiud* * investment model of education. 
See. e,g.. Gary Becker, Human CapM, 2ad ed. (New Yoric, National Bureau of Economic Research, 1975). F6r application of the cwicept of indirect 
costs to educational testing see also Walter Haney, Oeoige PAadaus, and Robert Lyons, Boston CoUege, **The Fractured Marketplace for Standanlized 
gl" ing,** uiq>ubiished manuscript, September 1989. 
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Box 1 -G— Direct and Opportunity Costs of 
Testing 



Total costs * 
direct ♦ 
opportunity 
costs 




Testing 
option 1 



Testing 
option 2 



Time spent on testing: 
preparation and administration 



Tbk figure illustrates the idatkmship between 
time spent on testi) ig activity and the total costs of 
testing. Hypothetical test 1 is assumed to contribute 
little to classroom learning, ft costs litHe in direct 
dollar outkys, but is dear in q)poftuttity costs. Ibtd 
costs b^ rdativety low but rise rapidly with time 
devoted by teacbm and students to activities that 
take them away from instruction. 

Hypothetical test 2, which is a usefol instruction 
and leamfaig tool, requires-iehrtively high direct 
expenditures. But die onpurtunity costs of time 
devoted to testhig are relatively low. 

At pdnt A, a sdiool district would be inJifiSntnt 
between die two testing programs^ if cost was the 
main consideration. 

SOURCE: OfiRce of Ibcbnolofy Aisetmeat^ 1992. 



kept secret or will they be disclosed after the 
test? 

e If the objective of the test is motivational, i.e., 
to induce students and teachers to work harder, 
then the test is likely to be high stakes. What 
will happen to students who score low? What 
resources will be provided for students who do 
not test well? Wliat inferences will be made 
about students, teachers, and schools on the 
basis of test results? What additional factors 
will be considered in explaining test score 
differences? FinaUy, will the tests focus the 
attention of students and teachers on broad 
domains of knowledge, as desired, or on 
narrower subsets of knowledge covered by the 
tests, as often happens? 

e If the Nation is interested in using tests to 
improve the qualifications of the American 
O . work force^ how will valuable nonacademic 



skills be assessed? What should be the balance 
of em^diasis between basic skill mastery and 
higher order thinking skills? 

• If there is impatience to produce a test quickly, 
it is likely to result in a paper-and-pencil 
machine-scorable test. What signal will this 
give to schools concerning the need to teach all 
students broader communication and problem- 
solving skills? 

e Whaf effects will national tests have on current 
State and local efforts to develop alternative 
assessment methods and to align th&ir tests 
more closely with local educational goals? 

• Would the national examinations be adminis- 
tered at a single setting or whenever students 
feel they are ready? 

« Would students have a chance to retake an 
examination to do better? 

• Would the tests be administered to samples of 
students or all students? 

e At what ages would students be tested? 
e What legal challenges might be raised? 

If a test or examination system is placed into 
service at the national level before these impor- 
tant questions are answered, it could easily 
become a barrier to many of the educational 
reforms that have been set into motion, and could 
become the next object of concern and frustration 
within the American school system. 

Given that a national testing program could be 
undertaken through State and/or private sector 
initiatives, the role of Congress is not yet entirely 
clear. However, to the extent that congressional 
action regarding NAEP, Chapter 1 , and appropriate 
test use will affect the need for and impact of any 
national examinations, Congress has a strong inter- 
est in clarifying the purposes and anticipated conse- 
quences of such examinations. Also, Congress must 
c^'reftdly analyze the pressures the national test 
movement is exerting on these programs, such as the 
idea of converting NAEP into a national test for all 
students. 

Future of the National Assessment of 
Educational Progress 

NAEP has proven to be a valuable tool to track 
and understand educational progress in the United 
States. It was created in 1969 and is the only 
regularly conducted national survey of educational 
achievement at the elementary, middle, and high 
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Box National Testing: Lessons From Overseas^ 

Tlie American educationai system has a traditional commitment to pluralism in the definiti<m and control of 
cuiricula as weU as the fiur provision (rf educational op^^ 

exanunation systems, vH^lk have historically been geared i^incipally toward selection, placementt and 
citdentiallng, need to be considered judiciously. OIA IBnds that die following factors should be consid^ ^tlien 
comparing examination systems overseas widi those in die United States: 

• Examination systems in aknost every industrialized countiy are in 

have been quite radical in several countries. Nevertheless, there is still a relatively' greater emphasis on tests 
used for selection, placement, and edification dian in the United States, 

• NcHie of the countries smdied by OIA has a single, central^ prescribed examination that is used for all 
purposes-^hssroom diagnosis, selection, and sdiool accountabili^. Most examinations overseas are used 
today for certifying and s<»ting individual students, not for schod or system accountability. Accountability 
in European countries is typicaUy handled by a qrstem of inq)ectors charged with oversee^ 
examinaticm quality. Some countries occasionally test samples of students to gauge nati<mwide 
achievement 

• External examinations before age 16 have all but disappeared from the countries in the Eurqiean 
community. Primary certificates used to select students for secondfluty schools have been dropped as 
c<miprehensive education past die primary level has become available to all students. 

• The United States is unique in the extensive use of standardized tests fwyo^ 

for testing all American clementaiy school children witfi a commonly adndnistered and graded examination 
would make die United States the ody industrialized country to adopt ttii^ 

• There is great variation in die degree of central contn>l over curricuhmi^ 

some countries centrally prescribed curricula are used as a basis for required examinations (e.g«, France, 
Italy, the Nedierlands, Portugal, Sweden, Israel, Jqum, China and, most recently, die United Kingdom). 
Other countries are more like die United States ih ^ autonomy of States, provinces, or districts in setting 
curriculum and testing requirements (Australia, Canada, Germany, India, ami Switzerland). 

• Whediercentndly devek^tf not, die examinations taken during and at d^ 

odier countries are not die same for all students. Syllabi in European countries determine subject-matter 

^Thif draws on ioforaiatioii from Geofge Madaui, Bottoa CoUege, sod niociias KeUagbaii, St. Patricks CoUegc« Dublin* ''Stodent 
Examination Systems in the European Conmimity: Lessons for the United States." OIA contractor ttfotu June 1991. 

Continued on next page 



school levels. It was designed to be an educationai 
indicator, a baronnieter of tlie Nation's elementary 
and secondary educational condition. NAEP reports 
group data only, not individual scores. 

NAEP has also been an exemplary model of 
careful and innovative test design. As discussed in 
chapter 3, NAEP has made pioneering contributions 
to test development and practice: **niatrix** sam- 
pling methods, broad-based processes for building 
consensus about educational goals, an emphasis on 
content-referenced testing, and the use of various 
types of open-ended items in large-scale testing. 

If Congress wishes to develop a new national 
test-'-to be administered to each child and used as 
a basis for important decisions about children 
and schools — OTA concludes that NAEP is not 
appropriate. This objective would require funda- 
Q -lental redesign and validation of NAEP, and would 
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altor the character and value of NAEP as the 
Nation's independent gauge of educational progress. 
It V 'ould also greatly increase both the cost and time 
devoted to NAEP at every level. 

A better course for Congress is to retain and 
strengthen N AEP's role as a national indicator of 
educational progress, lb do this, Congress could: 

• require NAEP to include ni>ore innovative items 
and tasks that go beyond multiple choice; 

• fund the development of a clearinghouse for the 
sharing of NAEP data, results of field trials, 
statistical results, and testing techniques, giv- 
ing States and local districts involved in the 
design of new tests better access to the lessons 
from NAEP; 

• restore funding for NAEP testing in more 
subject areas, such as the fine arts; 
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Box 1-H— ISational 'Rftingt Lesions From Overseas-Conttnued 
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countiki. AUnogh iituei of fdmeis and comiMnl^ 
the United States. 

• Tbacfasrs in other conaliks have coiisideiaMeiBsponslbi% far admini^^ 

some oomitiies (Getmai^, the U.S.S JL, aod Sweden) they even giade dieir own stodentt. Ibacher coottacts 
often inchide the ene^adoQ diat diey will develop or scoie examinatk^ 
summer pay to iMd exandnadons. 

• SyMA, uspici, and even sanqde qoettioiis are widely puUicized in advance of examinitioof , and it U not 
consfalered wrong to piqMreoqpUcitly for ex^^ 

inffaiences instracdon and leaning. 

• In EuKopean ooontiies, die dominant farm of examination is "essay on demand" These examinations 
lequtae students to write essays of vaiy^ lengths in lesponses to shostHnswer or open-ended questions. 
Use of naddple^hoioe examfaiatkms is liinfted, except in Japan, when di^ 

States. Ond examinatiom are itm common in lOfEs of die Oennac lander and in foreign language testing 
in many countries. Perfomiance assessments of odier kinds (demonstiations and poitfolios) are used for 
internal classroom assessment 



• support the continued development of methods 
to communicate NAEP results to school offi- 
cials and the general public in accurate and 
innovative ways (particular emphasis could be 
placed on informing the public about appropri- 
ate ways to interpret and understand such test 
data and on minimizing misinterpretation by 
the press and gt^^neral public); 

• add testing of nonacademic skills and knowl- 
edge relevant to the world of work; 

• restore funding for the assessment of out-of- 
school youth at ages 13 and 17, to provide a 
better picture of the knowledge and skills of an 
entire age cohort; 



• request data on the issues surrounding test- 
takers' motivation to do well on NAEP in 
various grades;^^ 

• expand NAEP to assess knowledge in the adult 
nonschool population; and 

• ensure that matrix sampling is retained, to 
minimize both costs and time requirements of 
NAEP. 

An experiment in extending the uses of NAEP to 
provide data on educational progress at the State 
level and to measure this progress against national 
standards is now under way. 

OTA has identified three potential problems of 
using NAEP for State-by-State comparisons that 



*Ha ptttlcultr, qumtioiu have been raiaed about the accuracy of infonnatlaa derived from tests of 12th gndcn who are about to graduate. Fuither 
O .effoits and research could shed light on this issue. Ed Roeber. Michigan Educational Assessment Program, personal communication, October 1991. 
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Photo cf$<Mt: tMontd An—$amt of Eduoathntii Progrms 

The National Assessment of Educational Progress (NAEP) 
has pioneered the use of performance assessments In 
large-scale testing programs. In tNs sdenoe task, 7th and 
1 1 th grade students Ogate out vvhich of the three materials 
would make the box weigh the most 

Congress should review before makiog a final 
decision on a permanent use of NAEP for this 
purpose* First, States could be pressured to introduce 
curriculum changes to in^>rove their NAEP per- 
formance on certain subjects, regardless of whether 
such changes have educational merit. For example, 
following the release in 1991 of the State-by-State 
results from the first such trial, some States (e.g., the 
District of Columbia) announced plans to revamp 
their mathematics curricula. It could be argued that 
the use of NAEP as a prod to State education 
authorities to rethink their curricula is a good thing; 
however, it is clear that the pressure to perform on 
the test can outweigh the stimulus for careful 



deliberation about academic policy, and that many 
States could make changes for the sake of higher 
scores rather than improved learning opportunities 
for children. This signifies putting the cart of testing 
before the horse of curriculum, exactly the kind of 
outcome feared by the original designers of NAEP 
who insisted that scores not be reported below broad 
regional levels of aggregation. 

Second, the presentation of conq^arative scores 
could lead to intensified school-bashing — even when 
differences in average State performfince are statisti- 
cally insignificant or when those differences reflect 
variables far beyimd the control of school authori- 
ties. Critics of comparative NAEP reporting point 
out that low-scoring States need re^ help — ^finan- 
cial, organizational, and educational — not just more 
testing and public humiliation. 

Finally, extending NAEP to State-level analysis 
and reporting is a cosdy undertaking. NAEP funding 
jumped from $9 million in 1989 to $19 million in 
1991 . It is not clear that this extra money provides a 
propwtional amount of useful information: one 
researcher interested in this question showed that 
roughly 90 percent of the variance in average State 
performance on NAEP could be explained by 
socioeconomic and demogrq)hic variables already 
available from other data.^^ In a time of scarce 
educational resources, NAEP extensions need to be 
weighed carefidly on the scale of anticipated bene- 
fits per dollar. State-by-State comparisons of NAEP 
performance may not pass this cost-benefit test."^ 

These issues notwithstanding, many education 
policymakers at the State and national levels have 
insisted that State-level NAEP could provide new 
and useful information to support curricular and 
instructional reform. Their arguments should be 
taken as potentially fruitful research hypotheses and 
treated as such: just as new medical treatments 
undergo careful experimentation and evaluation 
before gaining approval for general public use, 
extensions and revisions to NAEP should be post- 
poned pending analysis of research data. 

In education, the line between research and 
implementation is often blurred; few newspapers 
i ^ted that the 1990 State mathematics results were 
the first in a **trial'* program — the results were 



^^See Richard Wolf, Ifeschert College. Columbia University. '*What Can We Learn From State NAEP?'* uiq^ublisbed document, ad. 

^See also Daniel Koretz, "State Cmnparison Using NABP: Large Costs, DisappoinUog Benefits.'' Educational Researcher, vol. 20, No. 3, Apiil 
O [.pp. 19-21. 
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treated as factual evidence of relative effectiveness 
of State education systems. 

The NAEP standard-setting process also raises 
questions of feasibility and desirsibility. As dis- 
cussed in chapter 6, the translation of bit)ad educa- 
tional go*4s — such as emphasizing problem-solving 
skills in the mathematics curriculum — ^into specific 
test scores is a complex and time-consuming task. 
The particular performance standards selected must 
be vaUdated cmpiricaUy: how closely educators in 
different parts of the country will concur on stand- 
ards of proficiency for children at different stages of 
schooling is not known. Standard setting has always 
been a slippery process — in employment, psycho- 
logical, or educational testing — in large part because 
of difficulties surrounding the designation of accept- 
able **cutoff scores.** Not surprisingly, controversy 
surrounded the initial attempts to reach consensus on 
standards for NAEP, with experts disagreeing among 
themselves on key definitions and int^retations of 
items. 

Educators and policymakers continue to debate 
whether nationwide standards are desirable, espe- 
cially if children who do not reach the defmed 
standards are somehow penalized. In addition to the 
potential effects on children, turning NAEP into a 
higher stakes test — with implicit and explicit re- 
wards pegged to achievement of the given profi- 
ciency standards— could irreparably und^nnine 
NAEP's capacity as a neutral barometer of educa- 
tional progress. 

While continued research on State-by-State 
NAEP and on standard setting will be useful, 
Congress needs to find ways to ensure that data 
from this research are reported as such and that 
the results are not prematurely construed as 
conclusive* 

Chapter 1 Accountability 

Because of its scope and influence. Chapter 1 
represents a powerful lever by which the Federal 
Government affects testing practices in the United 
States. OTA's analysis of Chapter 1 testing and 
evaluation requirements (see ch. 3) suggests several 
congressional policy options that could improve 
Chapter 1 accountability while reducing the overall 
testing burden in the United States. 

Chapter 1, the largest Federal program of aid to 
O ' mentary and secondary education, provides sup- 

ERJC 



plementary ed^'cation services for disadvantaged 
children. Over its 25-year history, Chapter 1 evalua- 
tion and assessment requirements have been revised 
many times. The result is an elaborate web of legal 
and regulatory requirements with standardized norm- 
referenced achievement tests as the basic thread. The 
tests fulfill several functions: Federal policymakers 
and program administrators use nationally aggre- 
gated scores to judge the program's overall effec- 
tiveness; and local school districts and States use 
scores to determine which schools are not making 
sufficient progress in their Chapter 1 programs, to 
place children in the program, to assess children's 
educational needs, and for other purposes. 

As a result of the 1988 amend ments to Chapter i, 
which introduced the ''program improvement" 
concept, Ch^ter 1 testing became even more 
critical. At the national level, there has been growing 
concern chat the aggregated test data— collected by 
school districts with widely divergent expertise in 
evaluation — do not provide an accurate and well- 
rounded portrait of the program's overall effective- 
ness. At the school district level, educators argue 
that the test data often target the wrong schools for 
program in^)rovement or miss the schools with the 
weakest programs in the district or the subject areas 
and grade levels most in need of help. At the 
classroom level, teachers t^M to feel that their own 
tests and assessments, as well as some externally 
designed criterion-referenced tests, afford a much 
better picture of individual students' progress than 
do the norm-referenced tests. 

Congress' principal challenge vis-d-vis Chapter 
1 is to find ways to separate Federal evaluation 
needs from State and local needs. It is a tough 
dilemma: to balance the national desire for meaning- 
ful and conqparable program accountability data 
against State and local needs for useful information 
on which to base instructional and progranmiatic 
decisions. Congress will consider reauthorization of 
Chapter 1 in 1993. Hearings and analysis on these 
complex questions in 1992 would provide an excel- 
lent basis for a major revision of the evaluation and 
testing requirements. 

One way to improve Chapter 1 accountability 
is to create a system that separates national 
evaluation needs ft*om State and local informa- 
tion needs. It is the perceived need for nationally 
aggregated data that drives the use of norm- 
referenced tests. If Congress separated national 

4 7 
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evaluation purposes horn State and local purposes 
and articulated different requirements for each. State 
Education Agencies (SEAs) and local education 
authorities would be free to use a variety of 
assessment methods that better reflect their own 
localized Oiapter 1 goals. Hie national data would 
be used to give Federal policymakers, ta^ayers, and 
other interested groups a national picture of Oiapter 
1 effectiveness, while the State and local informa- 
tion would be used in modifying programs, placing 
students, targeting schools for program improve- 
ment, deciding on continuation of schoolwide proj- 
ects, and other purposes. 

Congress could obtain national data on Chapter 1 
through a well-constructed, periodic testing of 
Chapter 1 children, similar to the way NAEP is used 
to assess the progress of all students. This assess- 
ment would rely on sampling (rather than testing of 
every student) and could be administered less 
frequently than the cunent tests. In addition to 
relieving the testing burden on individual students 
and reducing the time devoted to testing by teachers, 
principals, and other school personnel, tfiis proce- 
dure could also result in higher quality data. As the 
jidncipal client of the data, the Federal Government 
c;ould identify the areas to be assessed, instill greater 
standardization and rigor in test administration and 
data analysis, and avoid the aggregation problems 
that arise from thousands of school districts admini- 
stering different instruments under divergent condi- 
tions. This type of Federal assessment could be 
designed and administered by either an independent 
body or the Department of Education, with the help 
of the Chapter 1 Ibchnical Assistance Centers. 

The system might be designed to provide a moiu 
of assessment options — criterion-referenced tests, 
reading inventories, directed writing, portfolios, and 
other performance assessments — ^from which States 
could establish statewide evaluation criteria for 
Chapter 1 programs. If Congress preferred maxi- 
mum local flexibility, the discretion to choose 
among the assessment options could be left to school 
districts, as long as they administered the instru- 
ments uniformly and consistently across schools. 
The Chapter 1 Ibchnical Assistance Centers could 
help the States and school districts select and 
implement appropriate measures. 

Either a State or local option would increase the 
latitude for linking assessments to specific program 
ij ils. However, if States or districts were to select 




instruments that put their Chapter 1 programs in the 
best light, the information could be misleading. 
Congress should take steps to see that this does not 
happen. For example, a strict fqpproach would 
require programs to show growth in student achieve- 
ment using multiple indicators, perhaps including 
one indicator bas^ on a standardized test. A looser 
version of this option would allow States or districts 
to develop their own evaluation methods, and set 
their own standards of acceptable progress, subject 
to Department of Education approval. 

An advantage of separating evaluation require- 
ments would likely be local development of new 
testing methods, which have not been widely used in 
Chapter 1 because of the need for national aggrega- 
tion and comparability. Congress could encourage 
this choice by reserving some of the Federal Chapter 
1 evaluation and research funding to advance the 
state of the art 

For example, competitive grants could be author- 
ized for local education agencies, SEAs, institutions 
Ox iiigher education, Ibchnical Assistance Centers, 
and other public and private nonprofit agencies to 
work on issues such as calibrating alternative 
assessments, training people to use them, bringing 
down the cost, and making them more objective. 
Congress could also consider allowing funds from 
the S-percent local innovation set-aside to be used 
for local development and experimentation. 

Since Chapter 1 is a major national influence on 
the amount, frequency, and t>pes of standardized 
testing, a broad research and development effort for 
Chapter 1 alternative assessment would have an 
impact far beyond Chapter 1. The instruments, 
procedures, and standards developed by this type of 
efrbrt would spill over into other areas of education, 
such as early childhood assessment, and would 
increase local districts' experimentation in other 
components of their educational programs. 

An important issue for congressional considera- 
tion is the appropriate grade levels for Chapter 1 
evaluations, lliere is considerable agreement that 
testing of children in the early grades is inappropri- 
ate, especially if standardized norm-referenced paper- 
and-pencil tests are used; the 1988 reauthorization 
eliminated testing requirements for children in 
kindergarten and first grade. On the other hand, there 
are compelling arguments that from a program 
evaluation point of view it is important to have 
**pre'' and **post'' data, which means collecting 
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some baseline infonnation. Lack of a reliable 
method to demonstrate progress during the early 
years could discourage princ4>als £rom channeling 
Chapter 1 funds to veiy young children, despite 
evidence that early intervention is very effective. If 
testing is required to show progress, these tests 
should be developmentally appropriate*^^ 

A related congressional issue concerns the assess- 
ment of school children who have only been in a 
given school's Chapter 1 program for a short period 
of time; school districts throughout the country cite 
the high mobility of Chapter 1 children as a logistical 
obstacle to meaningful evaluation* Despite regula- 
tory guidance, confusion continues to leign in State 
and local Chapter 1 offices about how to deal with 
a mobile student population. Clear and consistent 
policies regarding testing of these children would 
alleviate some of that confusion* 

Appropriate Test Use 

The ways tests should be used and the types of 
inferences that can appropriately be drawn firom 
them are often not well understood by policymakers, 
school administrators, teachers, or oibea consumers 
of test information. Pexhaps most important, many 
parents and test takers themselves are often at a loss 
to understand the reasons for testing, the inqxniance 
of the consequences, or the meaning of the results. 
School policies about how test scores will be used 
are important not only to students and parents but 
also to teachers and other school personnel whose 
own careers may be influenced by the test perform- 
ance of their pupils. Many of these problems result 
from using tests for purposes for which they are not 
designed or adequately validated* Fairness, due 
process, privacy, and disclosure issues will continue 
to fiiel public passions around testing. 

As reviewed in chapter 2, attempts to develop 
ethical and technical standards for tests and testing 
practices have a long history. The most recent 
attempt to codify standards for fair testing practice 
(in the Code of Fair Testing Practices in Educa- 
tion)^ led to a set of principles with which most 
professional testis groups concur. 



Educational testing practices in some areas have 
been defined by Federal legislation. In the mid- 
1970s. Congress passed laws with significant provi- 
sions regarding testing, one affecting all students 
and parents and the others affecting individuals with 
disabilities and their parents* In both cases this 
Federal legislation has had far-reaching implications 
for school policy, because Federal financial assist- 
ance to schools has been tied to mandated testing 
practices. The Family Education Rights and Privacy 
Act of ITJA — commonly called the ' 'Buckley Amend- 
ment'' after former New York Senator James 
Buckley — ^was enacted in part to attempt to safe- 
guard parents' rights and to conect some of the 
in^nroprieties in the collection and maintenance of 
pupil records* The basic provisions of this legisla- 
tion established the right of parents to inspect school 
records and protected the confidentiality of informa- 
tion by limiting access to school records (including 
test scores) to those who have legitimate educational 
needs for the information and by requiring parental 
written consent for the release of identifiable data* 

Given the growing importance of testing and 
the precedent for Federal action, several avenues 
are open if Congress wishes to foster better 
educational testing practices and appropriate test 
use throughout the Nation. 

One option for congressional action would aim at 
improved disclosure of information* Individual 
rights could be better safeguarded by encouraging 
test users (policymakers and schools) to do a careful 
job of informing test takers* Many critical decisions 
about test use, such as the selection and interpreta- 
tion of tests, are made in a professional arena that is 
well-protected from open, public scrutiny. This 
occurs in part because of the highly technical nature 
of testing design. Although the professional testing 
conununity is not unanimous about what constitutes 
good testing practice, there is considerable consen- 
sus on the importance of carefully informing indi- 
vidual test takers (and their parents or guardians in 
the case of minors) abo ut the purpose of the test, the 
u:es to which it will be put, the persons who will 



^See, e.g., Robert E. Slavin and Nancy A. Madden, Center for Research on Effective Scliooling for Disadvantaged SludenU, Hie J6tia Hopkins 
University, **Chapter 1 Program bq>roveniem Guidelines: Do They Reward Appropriate Practices?** pm>cr prepared for the Office of Educational 
Researdi and Improvement, U.S* Department of Education, December 1990. See also Nancy Kober, **The Role and Impact of Chapter 1 ESEA« 
Evaluatico and Assessment Practices,** OlA contractoi report, June 1991. 

^Joint Committee on Tfcstiiig Practices, Code of Fair Testing Practices in Education (Washington, DC: National Council on Measurement in 
O ication, 1988). 
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have access to the scores, and the rights of the test 
taker to retake or chauenge test results.^'^ 

Congress could require, or encourage, school 
districts to: 

• develop and publish a testing policy that spells 
out the types of tests given, how they are 
chosen, and how the tests and test scores will be 
used; and 

• notify parents of test requirements and conse- 
quences, with special emphasis on tests used 
for selection, placement, or credentialing deci- 
si(xis. 

A second approach for Congress is to encourage 
good testing practice by modeling and demonstrat- 
ing such practice at the Federal level The Federal 
Government writes much legislation that incorpo- 
rates standardized testing as one component of a 
larger program. For example, the Individuals With 
Disabilities Education Act (Public Law 101-476), 
formerly the Education for all Handicapped Chil- 
dren Act of 1975 (Public Law 94-142), was designed 
to assure the rights of individuals with disabilities to 
the best possible education; this legislation included 
a numbed of explicit provisions regarding how tests 
should be used to implement this program. 

Among the provisions were: 1) decisions about 
students are to be based on more than performance 
on a single test, 2) tests must be validated for the 
purpose for which they are used, 3) children must be 
assessed in all areas related to a specific or suspected 
disability, and 4) evaluations should be made by a 
multidiscipliuary team. 

Through these assessment provisions. Public 
Laws 101-476 and 94-142 have provided a number 
of significant safeguards against the simplistic or 
c^cious use of test scores in making educational 
decisions. Congress could adopt similar provisions 
in other legislation that has implications for testing. 
A recent example of Federal legislation that could 
lead to questionable uses of tests is a provision in the 
1990 Onmibus Budget Reconciliation Act. The 
objective of this provision is to reduce the high loan 
default rate of students attending postsecondary 
training programs G^rgely but not exclusively in 



proprietary technical schools). The policy lever is 
testing: the act requires students without a high 
school diploma to pass an ''ability-to-benefit" test, 
on the assuiiq)tion that students who are able to 
benefit from postsecondary training will be more 
likely to get jobs and pay back their loans than 
students who are not able to benefit. Basic questions 
arise about the i^propriateness of using existing 
tests to sort individuals on this broad ''ability" 
criterion. Even the most prevalent college admis- 
sions tests do not make claims of being able to 
predict which students will "benefit" in the long 
run, but rather which students will do well in then 
freshman year. 

A third course of action would focus on various 
proposals to certify, regulate, oversee, or audit tests. 
If Congress wants to play a more forceful role in 
preventing misuse of tests — ^in particular, preventing 
tests designed for classroom use or system monitor- 
ing from being applied to individual selection or 
certification decisions — this option is the clear 
choice. If testing continues to increase and takes on 
even more consequences, pressure for congressional 
intervention will grow. Proposals include Federal 
guidelines for educational test use, labeling of all 
mandated tests and test requirements, labeling of all 
contunercially available tests, and creating a govern- 
mental or quasi-goverrunental entity to regulate, 
certify, and disseminate information about tests. 
Hiis last option, which echoes a concept endorsed by 
the National Commission on Ibsting and Public 
Policy, has been discussed in testing policy circles 
for some years now."^ 

Finally, Congress could pursue more indirect 
ways to inform and educate consumers and users of 
tests. This might include supporting continuing 
professional education for teachers and administra- 
tors, or funding the development of better ways to 
analyze test data and convey the results more 
effectively to the public. 

Federal Research and Development Options 

Tbst development is a costly process. Even for a 
test or test battery that has already been in use for 
many years, it can take from 6 to 8 years to write new 
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^'^See, for example, American Psychological Asaociation, Standards for Educational and Psychological Testing (Washington, DC: 1985); Joint 
Committee on lasting Practices, op. d«„ footnote 46; and Russell Sage Foundation, Guidelines for the Collection, Maintenance, and Dissemination of 
Pupil Records (New York, NY: 1969), especiaUy Guideline 1.3. 

^S6e,e.g.,D. Ooslin, ''The Present and Fiituie of Assessment Ibw 
Q ^nsored by the U.S. I>epartment of Education, Mar. 23-25, 1990, draft dated July 19, 1990. 
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items, pilot test, and validate a major revision."^^ 
Most investigators working on new testing designs 
are wading into uncharted statistical and methodo* 
logical waters. For a new test, consisting of open- 
ended performance tasks or other innovative items, 
development and validation are substantially more 
expensive, even if test content and objectives are 
clearly deOned. For example, the development of a 
set of new performance measures assessing specific 
job-related skills for the armed services cost $30 
million over 10 years. The results of this sustained 
research effort, coordinated by the Department of 
Defense and carried out by the individual service 
organizations, were a set of hands-on measures, new 
supervisory ratings, job-knowledge tests, and com- 
puter-based simulations representing the skills re- 
quired in some 30 well-defined jobs. The main 
purpose of the research was to improve the outcome 
or criterion measures used to validate the Armed 
Services Vocational Aptitude Battery, the standard- 
ized test used to qualify new recruits for various job 
assignments.^^ 

In elementary and second ry school testing, 
however, the first step — defining the content that 
tests should cover — is much more complex than 
defining specific job performance outcomes for a 
number of jobs. The omnipresent issue of achieving 
consensus on content poses formidable barriers to 
test design. Even in a subject like mathematics, for 
which there is some agreement on outcomes and 
standards (as exemplified by the National Council 
on Ibachers of Mathematics' recent work on stand- 
ards for mathematics education), the definition of 
those standards took 6 years to develop. In most 
other subjects consensus on goals and curricula is 
more difficult to reach, adding substantially to 
research and development (R&D) costs. Moreover, 
separate standards, content, and tests would need to 
be developed for each grade level and subject to be 
tested. 

Another factor making testing R&D expensive is 
the question of how new assessment methods will 
affect students and teachers. Much of the interest in 
developing new assessments (see ch. 6) stems from 



the desire to see those assessments eventually 
become the basis for system monitormg and other 
high-stakes decisions. \^dation studies are there- 
fore critical. Random assignment experiments, which 
are costly, could encounter legal barriers because 
students' lives and educational experiences could be 
affected. Widation studies, therefore, may need to 
be conducted with quasi-experimental designs, which 
suffer from various statistical and methodological 
problems.^^ 

Congress has an important role to play in 
supporting R&D in educational testing, because 
adequate funding cannot be expected from other 
sources. Commercial vendors are not likely to make 
the requisite investments without some assurance of 
a reasonable return; they face strong market incen- 
tives to sell generic products that match the curricula 
of many school systems. But if these products are so 
general in their coverage that they reflect only a 
limited subset of skills common to virtually all 
curricula, schools may not see the advantage of 
adding them to an already strapped instructional 
materials budget* States might be willing to foot the 
R&D biU, although their education budgets are 
generally quite constrained. Moreover, in addition to 
costs associated with consensus-building on test 
content and evaluation of the anticipated effects of 
testing, new performance assessment and/or com- 
puter-based methods require basic research on 
learning and cognition. Basic education research has 
traditionally been a Federal responsibility. 

The question becomes how much: how much 
should Uie Federal Government spend on educa- 
tional testing R&D? The answer depends on the 
choice Congress makes regarding the value of 
dramatically enlarging the currently available range 
of testing methods. For example. Federal spending 
on educational assessment research is roughly $7 
million for fiscal year 1992, out of a total education 
research budget of close to $100 million.^^ This 
money is divided ahnost evenly among NAEP (for 
validation studies, evaluation of trial State assess- 
ment, and secondary data analysis); development of 
new mathematics and science assessments ($6 



^^Rudman, op. cit.» footnote 8» p. 8. 

^See Alexandra Wigdor and Bert Green (edsOiPer^c>nnanctfi455^5jrm«nr/^ 1 (Washington, DC: National Academy Press, 1991). 

^>See, e.g., Anand Desai, **lbchnical Issues in Measuring Scholastic Improvement Due to Compensatory Education Programs/* Socio-Economic 
Planning Sciences, vol. 24, No. 2. 1990, pp. 143-153. 

^^Education research and statistics spending in fiscal year 1990 was $94 million. See U.S. Department of Education, Digest of Educational Statistics, 
Q^'^lt op. cit, footnote 1, p. 344. 
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million over 3 years, administered through the 
National Science Foundation); and general assess- 
ment research (through the Center for Research on 
Evaluation, Standards, and Student lasting). 

Substantially more funding would be needed if 
Congress chooses to support: 

• cognitive science research on learning and 
testing, 

• development of new f^proaches to consensus 
building for test content and objectives, 

• reseai'ch on the generalizability of new testing 
methods across subjects and grades, and 

• validation studies of new testing methods. 



An intermediate fun/iing approach would be to 
. target Federal dollar^ toward: 

• the creation of a clearinghouse to facilitate 
continuing and more widespread dissemination 
of testing research results and innovations, 

• continuing professional education for teaches 
in the i^lications of new testing and assess- 
ment methods and in the appropriate interpreta- 
tions and uses of test results, and 

• the creation of a nationwide computer-based 
clearinghouse of test items from which States 
and local districts could draw to develop their 
own customized tests* 
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CHAPTER 2 

Testing in Transition 



Highlights 

• Since the 1960s testing in elementary and secondary schools has been caught in a tug between two 
powerfiil forces: increased public attention to test scores because of demands for evidence that the 
schools are educating children, and increased demands from educators and students for tests that more 
accurately reflect changing educational goals, new curricula, and reforms in teaching. 

• State-level concerns about the quality of education were the dominant force behind the rise of 
high-stakes testing beginning in the mid-1970s. Minimum conq)etency testing, for example, was 
embraced by notany State policymakers who believed that the imposition of external standards would 
boost educational quality. Since then, however, studies of the effects of this testing have led most 
educators to question the utility of tests as an instrument of reform. 

• Two decades of research about learning and cognition have produced imf *tant findings about how 
children leam and acquire knowledge. These findings challenge most tradiu j^nal models of classroom 
organization, curricula, and teaching methods. Among the most important findings are that teaching 
thinking skills need not await mastery of so-called ^^basic'' skills, and that all students are citable of 
learning thinking skiUs. Many educators now charge that significant changes in classrooms cannot go 
forward if traditional tests are to remain the primary indicator of achievement and program success. 
The tests must change, they argue, if schools are to change. 

• Many of the recent challenges to traditional tests have been directed at the noim-referenced 
multiple»choice tests most often used to assess educational achievement. It is not just the tests 
themselves the create controversy, however. Ibsting practices — the ways tests are used and the types 
of inferences drawn from them — also create many of the problems associated ivith testing. Appropriate 
testing practices are difficuh to enforce and few safeguards exist to prevent misuse and 
.\.iisinterpretation of scores, especially once they reach the public* 

• Ibst-use policy is important not only to students and parents but also to teachers and other school 
personnel whose own careers may be influenced by the test performance of their pupils. Concern for 
the increasing consequences being attached to test scores has helped fuel a backlash against 
standardized testhig that had been brewing since the expansion of high-stakes testing in the 1970s, 
when issues of fairness, test bias, due process, individual privacy, and disclosure were debated in 
Congress and the courts. 

• Although demands for accountability have not abated amid this environment of testing reform, most 
educators now urge the development and implementation of new testing and assessment technologies, 
and all caution against the use of tests as the sole or principal indicator of achievement. 



Overview 

Two decades of discussion about school quality 
have convinced many Americans that their educa- 
tional system needs substantial reform to meet the 
demands of the next century. Although the country 
is far from consensus about exactly what types of 
reform are needed, nearly all the initiatives call for 
changes in educational testing. 

Some school refomiers, primarily at the State 
g^^vel, have called for changes in testing to monitor 
ERJQdent progress in mastering fundamentally new 



curricula. Others have pinned their hopes on more 
high-stakes testing — including yet-to-be-developed 
national tests — ^to spur greater student and teacher 
diligence. This group includes educators and policy- 
makers who believe that new and better tests can 
lead to improved learning, as well as those who 
believe in conventional tests as a catalyst of change. 
Still others fear that more testing of any type will 
only exacerbate the problems of test misuse and 
unfairness, and will be counterproductive to school 
reform. These debates should not surprise anyone 
familiar with the U.S. education system: standard- 
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ized tests have always been prominent, and discus- 
sion of educational reform inevitably involves an 
examination of testing. 

Since the 1960s Americans have turned increas- 
ingly to testing as a tool for measuring student 
learning, holding schools accountable for results, 
and reforming curriculum and instmction. Ibsting in 
elementary and secondary schools has, therefore, 
increased in both frequency and significance. As 
shown in figure 2- 1 , revenues from sales of commer- 
cially published standardized tests for K-12 more 
than doubled between 1960 and 1989; i.e., from 
about $40 million in 1960 to about $100 million in 
1989 (in constant 1982 dollars) A recent report of 
ihe National Conmiission for Ibsting and Public 
Policy estimates that the 44 million American 
elementary and secondary students take 127 million 
separate tests annually, as part of standardized test 
batta:ies mandated by States and districts.^ 

Much of this growth in testing occurred during a 
period of economic, social, and demographic turbu- 
lence, and is attributable to Federal, State, and local 
demands for increased accountability.^ These strate- 
gies for change, such as performance reporting, 
establishing and enforcing procedural standards, and 
changing school structure or the professional roles of 
school personnel, rely on test information about 
schools and students.^ 

At the Federal level, demands for test-based 
accountability emerged as a consequence of substan- 
tial new financial commitments to education on the 
pjtft of the Federal Government. State-mandated 
tests, often designed and administered by State 
authorities (rather than by commercial vendors) 
have also grown dramatically; State-level concern 
with the quality of education, and State-level de- 
mands for improvement in the outcomes of school- 
ing, have perhaps been the dominant forces behind 



Figure 2-l~Revonuo8 From Sales of Commercially 
Produced Standardized Tests in the United States, 
1960-90 
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the rise of standardized testing in the past two 
decades. 

The pattern of increased testing followed by 
increased controversy dates to the initial uses of tests 
to stimulate school reform in the 19th century ."^ In 
different periods the specific causes of controversy 
over testing have varied. Today the debate stems 
from three main factors. 

First, many of the people and school systems 
attempting to redesign curricula and reform teaching 
and learning feel stymied in their efforts by tests that 
do not reflect new education goals. Moreover, 
because tests have increasingly high stakes, reform- 
ers find that bold new ideas of curricula y^d 
instruction cannot surmount the power of tests to 
reinforce traditional learning. For example, the basic 
"building-block** approach to student learning — 



^National Commission on Tfesting and Public Policy, From Gatekeeper to Gateway: Transforming Ttsting in America (Boston, MA: 1990). p. 15. 
Ibst publishers claim that the National Commission exaggerates in its estimate of testiiog. For example, the Vice President for PuUishii^ at one of the 
laigest educational test publishing companies argues that: *'Our data sources indicate that roughly 30 to 40 million standardized tests are administered 

annually across the country. . . [at an annual] total cost of . . . $100 million to $150 million See Douglas MacRae. '"Ibpic: Ibo Much Ibsting?** 

From CTB Publisher's Desk, No. 3, Nov. 15. 1990. 

^TTic woik of LeonLessinger, ''AccountabiUty for Results/* American Education (Washington, OC: U.S. Office of Education, Juno-July 1969), is 
often credited with igniting the most recent wave of accountability in education. For a synthesis and discussion of approaches to accounUbillty in 
education see Michael Kirst, Accountability: Implications for State and Local Policymakers (Washington, DC: U.S. Department of Education, July 
1990). 

^*'An aroused parent group, for example, will follow up on the results of a negative school report card by lobbying the school board for a new 
principal.*' Kirst, op. cit., footnote 2, p. 7. 
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the idea that children needed to be solidly grounded 
in the basics before acquiring advanced thinking and 
problem-solving skills-^ias been gradually sup- 
planted by new research findings. Curriculum spe- 
cialists as well as teachers have begun arguing for 
new approaches to the definition and instruction of 
''higher order skills/' and for changes tliat could 
make tests better indicators of learning. 

Second, the demand for test-based accountability 
continues to grow. Advocates of test-based account- 
ability argue it is an efficient and effective way to 
make students, teachers, and schools work harder. 
Some go so far as to suggest that raising the stakes 
of these tests can put America back on the road to 
global economic hegemony: since teachers will 
teach and children will study what is tested, the 
thinking goes, then the tests themselves can drive 
educational reform.^ Opponents of this view charge 
that high-stakes testing sends the wrong signals to 
students and teachers, and encourages emphasis on 
test taking and test preparation rather than genuine 
learning. They also argue that attacliing high stakes 
to tests threatens the validity of the information 
provided by the tests and leads to erroneous policy 
inferences. 

Third, as the tension surrounding tests increases, 
so do concerns about the appropriate use of tests and 
the effects of tests on individual rights. The history 
of testing is littered with examples of tests being 
used in ways not intended by their developers, 
tempting policymakers and the public to draw 
inferences not supportable by test data. 

The three camps — those who support new ap- 
proaches to assessment and testing, those who think 
more high-stakes testing will improve education, 
and those who are worried about ethical and legal 
aspects Oi testing — share a common concern for 
raising the quality of American schooling. But their 
strategies are crafted from visions of the educational 
system and the nature of human learning glimpsed 
through very different prisms. 



Changing Views of Teaching 
and Learning 

A quiet but dramatic transformation is occurring 
in education as researchers and practitioners rethink 
basic beliefs about teaching and learning. Two 
decades of research from developmental and cogni- 
tive psychology have produced important findings 
about how children leara and acquire knowledge.^ 
The basic concept in this research is that children are 
active builders of their own knowledge, not merely 
passive receptacles for information. These research 
findings and the instructional theories they have 
spawned raise serious challenges to traditional 
classroom organizational models, to conventional 
curricula, and, in turn, to existing forms of testing. 
Moreover, they have rekindled an awareness of the 
close links between instructional goals and assess- 
ment. 

Evolving Views of Learning 

In their teaching methods, curricular maierials, 
and testing methods, n^-uiy schools today embody a 
behaviorist model of teaming first popularized in the 
1920s. In this model: 

. . . learning is seen to be linear and ^uential. 
Complex underGtanding can only occur by the 

accretion of elemental, prerequisite learnings 

The whole idea was to break desired learnings into 

constituent elements and teach these one by one 

The implications of this model for instruction aie 
conveyed best by . . . [thei :netaphor of a brick wall, 
i.e. , it is not possible to lay the bricks in the fifth layer 
until the finst, second, third, and fourth layers are 
complete.^ 

This model assumes that more complex skills can be 
broken down into simple skills, each of which can be 
mastered independently and out of context. When all 
requisite components are mastered, then more com- 
plex thinking skills can accrue. According to this 
view, the highest levels of knowledge are achieved 
only at the later grades and, even then, only by some 
students. In this conventional model, moreover, the 
teacher is the active partner in the educational 



^See e.g., Robert SamueUon* **The School Reform Fraud/* The Washington Post, June 19» 1991. p. A19. 

^e following discussion about constructivist and behaviorist models of learning draws on Lauren B. Resnick and Daniel P. Resnick« "Assessing 
the Thinking Curriculum: New Ibols for Educational Reform/ * paper prepared for the National Commission on Ibsting and Public Policy, August 1989; 
Lonrie A. Sbepard, University of Colorado at Boulder, '•Psychomelricians* Beliefs About Learning/* paper presented at tlic annual meeting of the 
American Educational Research Association^ Boston, MA, Apr. 17, 1990. 

Q '^Shepard, op. cit., footnote 6, p. 15. 
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process, imparting knowledge to a passive student as 
though filling an empty jug. 

This hierarchical view of complex thinking is 
challenged by recent research from the cognitive 
sciences. 

One of the most important flndings of recent 
research on thinking is that the kinds of mental 
processes associated with thinking are not restricted 
to an advanced or 'higher order' stage of mental 
development. Instead, thinking and reasoning are 
intimately involved in successfully learning even 
elementary levels of reading, mathematics, and other 
school subjects. Cognitive research on children's 
learning of basic skills reveals that reading, writing, 
and arithmetic — the three Rs — involve important 
components of inference, judgment and active men- 
tal construction [sec box 2-A]. The traditional view 
that the basics can be taught as routine siulls, with 
thinking and reasoning to follow later, can no longer 
guide our educational practice.^ 

In fact, the term ''higher order" thinking skills 
seems something of a misnomer in that it implies 
that tliere is another set of ''lower order" skills that 
need to come fu:st. 

Another implication of the hierarchical **brick 
wall" model of learning is the notion that slower 
learners need to master low-level skills before they 
can move on to more complex skills. This sort of 
thinking underlies many compensatory edunation 
programs, in which educationally disadvantaged 
children or children who leara more slowly than 
their peers spend much of their time confmed to 
remedial classes consisting of drill and practice. By 
a process of remediation though repetition students 
are expected to master the low-level skills; many, 
however, spend a good portion (if not all) of their 
educational careers confined to the mastery of basic 
skills through remedial methods. The constructivist 
model of learning indicates that these students are 
capable of much more than this; this research 
suggests that all are naturally engaged everyday in 
problem solving, making inferences and judgments, 
and forming theories about how the world works. 

Several programs designed specifically to focus 
on increasing the achievement of disadvantaged 




Photo credit: Slgmgns Corp* 



Recent research has emphasized that learning Is an 

active process that can beet be supported In the 
classroom by hands-on activities and experinr>entatlon. 
As curricula and teaching practices change, 
new tests will also be needed. 

learners provide evidence to support the notion that 
these students are capable of learning far more than 
basic skills. The Accelerated Schools Program is a 
reform experiment designed to accelerate the learn- 
ing of at-risk students and close the ''achievement 
gap" while the students are still in elementary 
school. The program sets high expectations for 
student learning and frcuses on the teaciing of 
critical thinking and problem solving to all students. 
Although these programs do not yet have a long 
track record, teachers report delight and surprise at 
the gains achieved by participating students.^ An- 
other program, the Higher Order Thinking Skills 
(HOTS) project, provides Chapter 1 students in 
grades four through seven with enhanced thinking 
skiUs instead of remediation. The HOTS project has 
yielded compelling anecdotal evidence of substan- 
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^Kesnick and Resaick, op. civ , footnote 6» p. 2. 

'Gail Mcistcr» Research for Better Schools, Assessment in Programs for Disadvantaged Students: Lessons From Accelerated Schools/* OTA 
jtractor report, April 1991. 
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Box Fourth Grade Scientists Test a Theory^ 

For nine winters, experience had been their teacher. Every hat they had wom^ every sweater they had donned, 
contained heat. ''Put ^n your warm clothes/* parents and teachers had told thetn. So when the children in Ms. 
O'Brien's fourth grade science class began to study heat one spring day, who could blame them for thinking as they 
did? 

' 'Sweaters are hot/' said Katie. 

''If you put a thermometer inside a hat, would it ever get hoti Ninety degrees, maybe," said Neil. 

. . . [With O'Brien's help, the students set out to test these theories.] Christian, Neil, Katie, and the others placed 
therniometers inside sweaters, hats» and a roUed-up rug. When Uk tenq)erature inside refused to rise after 15 minutes. 
Christian suggested that they leave the thermometers overnight. After all, he said, when the doctor takes your 
temperature, you have to leave tlie thermometer in your mouth for a long time olding the sweaters and hats securely, 
the children predicted three digit temperatures the next day. 

When they ran to their experi^ > ^ents first thing the next morning, the children were baffled. They had been wrong. 
Now they'll change their minds, and we can move on, O'Brien thought. 

But ... the children refused to give up. ' 'We just didn't leave them in there long enough," Christian said. ' 'Cold 
air got in there somehow,' ' said Katie. 

. . . [O'Brien suggested they adjust their experiments and try again.] If, as they insisted, cold air had seeped inside 
the clothes overnight, what could they do to keep it out? . . . Neil decided to seal the hat, with the thermometer inside, 
in a plastic bag. Katie chose to plug the ends of the rug with hats. Others placed sweaters in closets or in desks, far 
away from the great gusts of cold air they seemed to think swept their classroom at night. 

... On Wednesday morning the children rushed to examine their experiments. They checked their deeply buried 
thermometers. From across the room, they shared their bewilderment. All the thermometers were at 68 degrees 
Fahrenheii. Clonfiised, they wrote in their journals. ' ' Hot and cold are sometimes strange, ' ' Katie wrote. ' 'Maybe [the 
thermometer] didn't work because it was used to room temperature." 

Meanwhile, O'Brien wondered in her own journal . . . how long she should let these naive conceptions linger. 
[She decided to have the students proceed with] . . . one more round of testing. And so the sweaters, hats, and even 
a down sleeping bag brought from home were sealed, plugged, and left to endure the cold. 

... For the third day in a row in O'Brien's classroom, the children rushed to their experiments as soon as they 
arrived. The sweater, the sleeping bag, and the hat were unwrapped. Once again the thermometers uniformly read 
room temperature. O'Brien led the disappointed children to their joumals. But after a few moments of discussion, she 
realized that her students had reached an impasse. Their old theory was cleariy on the ropes, but they had no new theory 
with which to replace it. She decided to offer them a choice of two possible statements. 

"Choose statement A or B," she told them. The first stated that heat could come from almost anything, hats and 
sweaters included. In measuring such heat, statement A proclaimed, we are sometimes fooled because we're really 
measuring cold air that gets inside. This, of course, was what most children had believed at the outset. Statement B, 
of O'Brien's own devising, posed the alternative that heat comes mostly from the sun and our bodies and is trapped 
inside winter clothes that keep our body hetit in and keep the cold air out. 

' 'Write down what you believe,* * O'Brien told the class. [Although some students clung to the ' 'hot hat" theoiy 
and some did not know what to think, most choose theoiy B.] 

"How can we test this new theory?" O'Brien asked. Immediately Neil said, "Put the thermometers in our hats 
when we're wearing them." And so the children went out to recess that day with an experiment under their hats. 

As Deb O'Brien relaxed during recess, she asked herself about tlie past three days. Had the children really 
changed their minds? Or had they simply been following the leader? Could they really change their ideas in the course 
of a few class periods? Would any of their activities help them pass the standardized science test coming up in May? 
O'Brien wasn't sure she could answer any of these questions affirmatively. But she had seen the faces of young 
scientists as they ran to their experiments, wrote about their findings, spoke out, thought, asked questions— and that 
was enough for now. 



^Exceipted from Bruce Watson and Richard Konicek« * *lbaching for CoDceptual Change: Coofroating Children's Bxpericoce," Phi Delta 
Kappm, vol. 71, No. 9. May 1990, pp. 68(^685. 
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tial gains in self esteem and enthusiasm for learning — 
as well as achievement test scores — ^when children 
participate in the program for 35 minutes a day over 
2 school years«^^ 

Additional evidence suggests that thinking and 
reasoning skills can be taught.*^ A number of 
programs have been designed to teach thinking and 
problem-solving skills; some focus on developing 
these skills within particular disciplines (e.g., math- 
ematics and reading) while others are aimed at 
enhancing general thinking skills that would, pre- 
sumably, be applicable in many different settings. 
The effectiveness of these programs is difficult to 
evaluate in the absence of appropriate outcome 
measures. Evaluations show students improving on 
measures tied to the material taught: students appear 
to learn to do the things the program teaches. The 
question of whether that learning generalizes is more 
difficult to assess, in part because there are few good 
outcome measures for these skills.^^ 

The results of these studies suggest some hopeful 
beginnings for the design of curricula and teaching 
methods focused on thinking and reasoning skills. 
Much of this work is new and experimental. 
Experimentation is needed to discern how much 
emphasis to place on general thinking skills and how 
much to emphasize thinking skills for specific 
knowledge and informatioki. Moreover, knowledge 
of how to teach those reasoning skills — ^at what ages, 
using what methods — ^is still very rudimentary. 

In sum, although educators have always at- 
tempted to fosto* reasoning skills, research about 
learning and the structure of knowledge suggests 
two major changes in how those skills should be 
taught. First, thinking skills need not be learned 
only after other, more basic skills are mastered. 
Second, all students are capable of learning 
thinking skills. 



Evolving Views of the Classroom 

Recent developments in education have con- 
verged to make more and more classrooms into vital 
laboratories for new teaching and learning methods. 
First, the growing presence of educational technol- 
ogy in the classroom, especially computers and 
integrated learning systems, is changing the defini- 
tions of what children need to know and how to teach 
it. 

Second, educators are radically rethinking the 
structure and content of their disciplines. For exam- 
ple, the National Council of Tfeachers of Mathemat- 
ics (NCTM) has proposed fundamental changes in 
the content and delivery of elementary and second- 
ary school mathematics instruction, changes that 
emphasize the use of manipulative objects and the 
teaching of analytical reasoning and problem- 
solving skills. MsUhematics educators have recog- 
nized that: . . the world is changing so rapidly 
that, unless those involved in mathematics education 
adopt a proactive view and develop a new assess- 
ment model for the twenty-first century, the mathe- 
matical understanding of children will continue to be 
inadequate into the future; " and they have worked 
to build consensus on a set of curriculum standards 
for K-12 education. Initiatives to revisit science 
curricula and teaching methods have also taken hold, 
with particular efforts to stress ''hands-on'' science 
experiments. In addition, many schools are experi- 
menting with the idea of the ''integrated curricu- 
lum,'' in which central themes or ideas are taught 
across disciplines and the school day is no longer 
divided into discrete periods labeled by subject. 

Third, attention is being directed towaid the 
development of materials and methods for cultivat- 
ing higher order thinking skills (see box 2-B). The 
emphasi^ on fostering reasoning skills has been 
bolstered by the widespread recognition that chang- 
ing economic and technological conditions wiU 



>0S. Pogrow. * •Challenging At-Risk Students; Findings Rom the HOTS Program,** Phi Delta Kappan, vol. 71. No. 5. January 1990. pp. 389-397. 

» >Fof descriptions of some of these efforts sec R. Olascr. ''Education and Thinking: Tbe Role of Knowledge/* Amer/caw PsychologisU vol. 39. No. 
2, February 1984. pp. 93-104; Lauwn B. Rcsnick. Education and Learning to Think (Washington, DC: National Academy Press. 1987); Lauren B. 
Rcsnick and Ixopold E. KJopfcr (cds.). Toward the Thinking Curricuium: Current Cognitive Resrarch, 1989 Yearbook of the Association for 
Supervision and Curriculum Development (Alexandria, VA: AssociaUon for Supervision and Curriculum Development, 1989); and Norman 
Frederikscn. ''ImplicaUons of Cognitive Theory for Instruction in Problem Solving/* Review of Educational Research, vol. 54, No. 3, fall 1984, pp. 
363''407. 

i^Resnick, op, cit.. footnote 11. 

^ i3TTiomas A. Romberg, E. Anne Zarrinnia. and Kevin F. Coliis. ''A New Worldview of Assessment in Mathematics.** Assessing Higher Order 
ng in Mathematics, G. Kulm (cd.) (Washington, DC: American Association for the Advancement of Science. 1990). p. 21. 
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Box 2*B— Thinking About Thinking Skills 

What aie ''higher Older tfiinking skills**? What do they look like and hem (k> we Im^ 
have diem? The fint truism seems to be diat tfiey are difficult to define; the second is that they are even harder 
to measure. 

Social scientists from many disciplines have studied mmul piocesses such as diinking, problem solving, 
reasoning, and critical thinking: although they have produced many careftdly wrought definitions, consensus 
about the natuie of diese processes has eluded diem. Educaticmal practitionefs, on die odier hand, have less 
interest in undeistandfaig die precise natuie of all possible ddnking processes; instead, practitioners are most 
concemed about die . . c<Hnplex thought processes required to solve problems and nuke decisions in 
eveiyday life, and tiiose tfiat have a direct relevance to instruction.'*^ One recent attempt to syndiesize die 
perspectives of philosophers, psychok>gistt, and educators has produced die oudine of dunking skills shown 
in table 2-BL As diis taUe suggests, at least some consensus exists about die kinds of skiUs educators would 
like to inchide in a diinking cuniculumu 



^IJl Alter and IJt Siimoo, Nortbweit Regional Edocationil UbonuOfy» ''AiiesiiQg Higher Onler lUnUog Skills: A 
CoDsumei't Ooide/' uapubllihed iqioit, Aptil 1961 ^ pp. 1-2. 



Ttbto 2-B1~Llit of Thinking and RMSonlng Skills 



I. Problem solving 

A. Identifying general problem 

B. Clarifying problem 

C. Formulating hypothesis 

D. Formulating appropriate questions 
& Qenerating related kieaa 

F. Fdrmulating alternative solutions 

Q. Choosing best solution 

H. Applytno the solution 

t. Monitoring aooeptance of the solution 

J. Drawing conclusions 

II. DedskmmaMng 

A. Stating desfted goal/condition 

B. Stating obstacles to goal/condition 
0. Identifying iKemat^es 

D. Examining alternatives 
E ItenMng Alternatives 
F. Choosing best alternative 
Q. Evaluating actions 

III. 



B. 



InducUve thinking sidlls 

1. Determining oause and effect 

Z. Analyzing open-erMied problems 

3. Rsaeoning by anatogy 

4. Maldng inferenoee 

5. Determining relevant Inforrnatton 

6. Recognizing relationships 

7. Solving insight probleme 
DeduoUvethiiMngskWs 

1. Using loglo 

2. Spotting oontradkrtory statemsnts 

3. Analyzing syllogisms 

4. Solving spatial probteme 



IV. Divergent thinking sMis 

A. Usthigattitbuteeof obieota/situattons 

B. Qenemtlng multiple Meas (fkiency) 

C. Qenerating different kleas (fiexibNity) 

D. Generating unk|ue kieas (originality) 
E Qenerating detailed ktoas (elaboratton) 
P. Synthesizing informatkHi 

V. Evaluative thtaWngsMIs 

A. DIstlngulsMng between fteti and oplnk>ns 

B. JudBtf^oredMlityofasource 

C. ObeervhHi andjudolng obeervatkKi reports 

D. Wentlfyingoentnrf issues and problems 
E Reoognldngunderiylngassumptkms 
P. Detecting bias, etereotypee, dkshes 

Q. Recognizing kMded language 

H. Evaluating hypothesee 

I. Classifying data 

J. Predicting oonsequenots 

K. Demonstrating eequenUa! synttMSIs of 

Information 
L Planning alternative strategtos 
M. Reoognudng hioonsistenclee In InformatkHi 
N. klentifying stated and unstated raasons 
0. Comparing similarities and differences 
P. Evaluating arguments 

VI. Phitosophy and reasoning 

A. Using diatogtoal/dlalectical approaches 



rK>rE:Thltllttlsb»s«donaoompllatkKi«rxidl«tMonofid«Mfr^ 
•oum. 

SOURCE: JA Artarand J.R. Salmon, florthwtit RogkKVil Eduontlon Ubonitory, ^'AmMlng Hlghfr^Ordtr Thinking 
SMIli: A Contumtr'i Qu4l«/' unpuUlthed r«f>ort, April lOST, p. 3. 
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require upgrading the cognitive skills of the work 
force.^"^ The combined effects of research on learn- 
ing and public concern for the state of the education 
system have led some educators to suggest that 
reasoning should be considered as ••the fourth R."*^ 
In classrooms across the country, teachers are 
experimenting with ways :o teach critical thinkmg 
and comprehension along wit'' basic skills and 
information. 

Implications for Standardized Testing 

Educators trying to implement these new ideas 
and classroom practices have found thems' Wes face 
to face with the dominance of standardiz u norm- 
referenced tests as the sine qua non of educational 
effectiveness. Many have found their new programs 
being judged by tests that do not cover the ills and 
goals central to their innovations. Those working on 
integrated curricula, a new vision of mathematics, or 
hands-on leaming environments have found their 
new programs measured by tests designed for very 
difijrent goals. Thus, a new and energetic movement 
has emerged focused on developing assessments 
more closely aligned with new curricula, leaming 
methods, and valued skills. 

The press for reform of tests to better match 
instruction and curricula comes from many sources. 
Educators are recognizing the potential of computers 
to change testing just as they are changing leaming. 
Curriculum reform groups, such as the NCTM 
standards committee, are seeking assessments better 
matched to their curricular and evaluation standards. 
Educators working to increase the achievement of 
disadvantaged learners express frustration that many 
of their critical program goals are not measured by 



existing standardized tests.^^ A common theme is 
that transformation of education cannot occur as 
long as tests embrace obsolete concepts about 
leaming. Without new assessment instruments, it is 
difficult to ascertain whether reforms in iiistmction 
and curriculum are working. 

What implications does a focus on thinking skills 
and active leaming have for test design? Reformers 
trying to implement a thinking curriculum agree on 
the need for changes that will better focus on 
reasoning skills and deep understandings. Ibst 
designers have always advanced the idea that an 
achievement test should be designed to reflect the 
goals of the curriculum. Most current achievement 
tests were constructed by careful delineation of the 
subject matter (e.g., reading, language arts, and 
mathematics); experts in the subject matter areas 
were largely responsible for specifying the domains 
of information and the skills to be mastered. 
However, **. . . a clear definition of the subject- 
matter content is essential, but insufficient by itself. 
An understanding of the learner's cognitive proc- 
esses—the ways in which knowledge is represented, 
reorganized, and used to process new information — 
is also needed."^^ 

Until recently most attempts to incorporate cogni- 
tive skills into test design were modeled on Bloom's 
taxonomy of cognitive behaviors,^^ which attempts 
to organize and classify the cognitive skills children 
are supposed to acquire. The taxonomy reflects a 
behavioral approach to leaming; educational objec- 
tives are written as clearly delineated, mutually 
exclusive categories of behavior that can be ob- 
served, counted, and classified. Ibsts based on this 
taxonomy are organized according to a content-by- 
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i^AlChough most analysis agree thai some improvcincnr in thinking skiUs wiU be beneficial, there is disagreement over how high to raise the thitshold. 
The disagreement stems from conflicting interpretations of data on the producUvity of the work force currcnUy and on the efifccU of technological change 
ou future sIdU requircmenU. For an eloquent discussion, sec Richard Mumane, '^Education and the Productivity of the Work Force: Looking Ahead/* 
American Living Standards. R. Litan. R. Lawrence, and C. Schultzc (eds.) (Washington, DC: Brookings InsUtution, 1988), pp. 215-24(5. 

A^i?^Tr Po"^^* ""Jf AbUity to Reason,** paper presented to the Federation of Behavioral, Psychological and Cognitive Sciences Science 
and Pubhc Policy Seminar, June 1989; and Uny Cuban, -Policy and Research Dilemmas in the Tbaching of Reasoning: Unplimed Dcsimr Review 
of Educational Research, vol . 54, 1 984, pp . 655-68 1 . c rr 

>^Sec Meister, op. cit., footnote 9. 

« . ^^'^^^/^^•^ y^^^^oS^'^ltsiDt&ign.'' The Redesign of Tssting for th^ 1985 ETS Invitational Conference. 

Eileen E. Freeman (ed.) (Princeton, NJ: Educational Tbsting Service, 1986), p. 73. 

^^BS.B\oom(tdXTaxonomyofEducationalObJectives:TheCias5ification of Educational Goals. Handbook 1-^ogniHve Domain (New York, NY* 
>Wademic Press, 1956). This discussion of the applications of Bloom's taxonomy to achievement testing is drawn from Romberg et al., op. cit., footnote 
13. Sec also Edward Hacrtel and Robert Calfee, "School Achievement: Thinking About What to Ust,** Journal of Educational Measurement, vol 20, 
O summer 1983, pp. 119-132. 
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behavior matrix. As the exa^iple in figure 2-2 
demonstrates, one axis of the mati ^x lists the content 
areas and the other axis describes the skills test 
takers are expected to demonstrate within each 
content area (in this example, computation, compre- 
hension, application, and analysis). Items are de- 
signed for each cell in the matrix. Despite changes 
over time in the specifics of each axis, the matrix 
approach to test design has persisted because **. . . it 
pennits a rapid overview of the entire structure [of 

a test] and relative emphasis on one part or 
another."i9 

Some critics of the taxonomic approach feel that 
the matrix oversimplifies the complexity of knowl- 
edge and how students acquire it. Subject matter 
experts from various content disciplines have criti- 
cized the way that such matrices artificially divide 
both content and skills into mutually exclusive 
categories, ignoring complex interrelationships. In 
fact, the matrix form, by its very nature, suggests 
relationships which are simple, numerically 
restricted and linear . , ."^^ — an outmoded concept 
that views thinking skills as hierarchically nested 
atop one another, with the learner moving ft^om 
simple thinking skills to more complex ones as 
achievement advances. 

Cognitive Research: Implications for 
New Test Design 

Since the publication of Bloom's taxonomy, 
considerable research has been conducted about the 
nature of the cognitive processes involved in learn- 
ing. The findings fi*om cognitive sciences research 
provide a basis for different kinds of instruction, 
curriculum materials, and tests that more closely 
resemble the processes involved in learning and 
thinking (see box 2-C). Findings from research on 
learning and cognition imply at least three broad 
changes for educational tests: 

1. Knowledge is a complex network of informa- 
tion and skills, not a series of isolated skills 
and facts. Ibsts designed to assess knowledge 
must reflect this complexity both in the tasks 
they require children to complete and the 
criteria diey use to evaluate a child's knowl- 
edge. 



Figure 2-2--Example of Content-by-Behavior Matrix 
for a 60-nem Mathematics Ibst 

Content areas 



P^^^^^i^'^ systems Geometry Algebra Total 



Computation 


15 


8 


7 


30 


Comprehension 


5 


5 


5 


15 


Application 


5 


3 


2 


10 


Analysis 


0 


4 


1 


5 


Total 


25 


20 


15 


60 



NOTE: The values In the cells represent thy number of Items on the test. 
Matrices like this are used In planning and designing tests. 

SOURCE: Office of Technology Assessment, 1992. Based on a concept 
discussed In Thomas A. Rombe«g, E. Anne Zarinnia, and Kevin 
J. Collls, ''A New Wortdview of Assessment In Mathematics/* 
Ass0sslng HighTOrd^r TNnMng In MathBmatfcSt Q. Kulm (ed.) 
(Washington, DC: Anrmrlcan Association for the Advancement 
of Science, 1990). 

2. The research suggests important new possibili- 
ties for tests that can diagnose a student's 
strengths and weaknesses* Diagnostic tests, 
informcJ' by cognitive science research, may 
help teachers recognize more quickly the 
individual learner's difficulties and intervene 
to get the learner back on track* The shift 
toward educationally diagnostic tests is an 
important one; it represents a move away from 
seeing tests as predictive indicators of a fixed 
* 'ability to leam** to tests that can help shape 
instruction so **all can leam/*^^ 

3. Because research indicates that much learning 
and thinking is active and occurs within a 
specific context, assessment of some skills 
may require testing methods more closely tied 
to the active learning process. Tasks may need 
to resemble what students should be able to do, 
and thus what they spend their time doing in 
the classroom. It is likely that tests that allow 
children to manipulate materials, explore naive 
theories, and demonstrate everyday cognition 
will more accurately reflect their competence 
levels across a range of skills. Instruction and 
assessment can be designed to focus on 
learning in context; as this happens more, 
especially in the new forms of assessment 
commonly referred to as * 'performance assess- 



^'l^oinbcrg ct al.. op. cit., footnote 13, p. 9. 
20|bid.,p. 15. 

O ^'Scc, eg., J.W. Pcllcgrino, **Anatomy of Analogy/* Psychology Today, vol. 19, No. 10, October 1985, pp. 48-54. 
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Box Tests as Avenues to Individualized Learning 

Cognitive and devdopmental psychdogistG tead to look tot patterns and similarities in the way people thinic 
and learn. While lesearch has documented some general patterns^ it has also found tremendous individual variation 
in the rates at which childitn team and develc^. Other reseaic^ 

differences in socUd, emotional, and motivational characteristics diat affect children's learning. Still otfiers have 
focused on die modality or ''style' ' by ^ch different children leant Many have reasoned tfiat if tests can diagnose 
learning styles, dien tfi(^ can aid k die develq^^ 

leandpg styles. There have been many theories, but no consensus on what diose different leamhig styles loc^ like* 
Attempts to match learning stytes to styles of instroction were udtialfy pop^ 
has not held up in part because the measures for diagnosing learning styl^ 
expected relationships witti adiievemmt^ 

Neverthetess, die research suggests diat die ''ability to team'' (a commonly used defimdon of '^intelligence") 
isnotafixedunituytrah: individuate do not have a certain amoimt of it diatpiedeterniines how weUan^ 
they can leam. The model of teandng disabilities provides a well-accq[Ked examine of how one or two areas of 
weakness, such as recognition of written w(mte, can interfim 

areas. While in die past diese children were often seen as unable to team^ or worse yet as "dunib," diekc^Mbilities 
are now recognized Many such diildren need alternative leandog mc^iods hi order to acquire necessary skiUs. 
Eveiy child biriuigs to any leammg situation a « 

and experience. Diagnostic tests can describe in detail die actual skills of a chikl in areas related to histruction. 
Strengdis can dien be ured by the teacher to sq>port and guide leamhig in more difficuh areas. 

One attenqx to describe children ' s skills more tnoadty te a recent effort to outUne ' 'mult^te kteOigences. ' ' 
Aldiough dieories of die mdt^le components of hitelligence hw 

woA suggests diat most of our current q^proadi to education, as wdl as assessment, has rdied heavily on 
develophig two types of hitelligence, whteh he calb "logicalnnadiematical" and "Ungbistic."^ Drewhig on 
evidence fkom multiple sources, faichidhig neuropsycholo^ and child develqmientt Gardner has proposed an 
additional five types of mtelligences: musteal spatial, bodily-kinesdietic, hiteipersonal and hitrapersonaL A 
student can be represented by a profile of ''uitelligences,*' each of whi^ 

Several educational pilot programs have grown out of diis dieoiy* One, die Key School hi IhdianapoUs, 
attempts to maxhnize histniction across aU 
Project Spectrum, has attempted to devek>p assessment activities ^ 

children. The goal of these efforts te to provide a proflte of strengdis and weaknesses across die seven areas diat 
can be used to direct educatfonal resources to die child; such a profile could help parents and teachers build on 
strengdis or boteter areas <tf weakness during die e^ly years.^ 

The dieory of multiple hitelligences provides one model for broadenhig traditional views about which skills 
and competencies are hnpoitant and require nurturing hi the school years. As one policy maker has noted: 
Gardner's wodc has been impoitam in attaddng die momdidiic n^ 
of our dilnldng. We are beginning to see diM 

leaders, but to develop die Istent tatems of die entire popuIad<m in diverse wi^s.' 



^D. Camlne, ''New Reeetrdbion die Bn^ ImpUcadoiii for Imtmctioii,*' PU D^UaKappan, vol 71, No. S, Jsnuiry 1990, pp. 372-377; 
and Keniielh A. Kavele and Steven R. Fomeit, ''Substance Over Siyte: AiteMitig the Efflcacy of Modality and IhiOi^,'^ ExcepHonal 
Children, vol. 54. No. 3, 1987, pp. 22S-239. 

^^Howard OardMr,^ Frmei of Mind (New York, NY: Basic Booki, 1985). For a Mer ditcunion of the cootributiotu of Speafiiiao« 
Ottllford, Thunuxie, and odier reseaidiM whose work was based on dUferant dieoriei of the itniccnre of itttelUgetice, see, e.f ., Rqmmd 
Fancher, The infelligence Men: Makers cfthe IQ Controversy (New Yofk, NY: W.W. Ndtton, 1985). 

^ wodc of Robert Stcmbcqi, aoothernodem pioneer of multiple imellifcooe, wUle focuaed largdy on adolta nuher than idiool 
chUdieo, also hu hnpoitant impUcatlooi for inatniction and aiiefinMint. S ee, f or example, hli book, Beyond iQ : A THarchic Theory of Human 
intelligence (London, Enibuid: CsssSxiugt UnWertity Piets, 1985). 

4For fttfUMT deacripdon of tbpae pn«ramt lee Marie Winn, ' 'New Viewi of Human IntelUgenoe,* ' New York Times Magating, part 2, 
The Good Healdi M^m^ Apr. 29, 1990, and Howard Gardner and Thomai Hatch, ''Multiple intelUseooei Oo to School*" Educational 
Researcher, vol. 18, No. 8, Novendwr 1989, pp. 4*9. 

^Rexford Drown, Dhector of Communlcatious for the Education Commisfion of the States, quoted in Winn, op. cit., footnote 4, p. 30. 
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ment" (see ch. 7), the lines between assess- 
ment and instruction blur. Assessment be- 
comes feedback to the learner, which in turn 
promotes further learning and growth. 

There are many more specific ways in which the 
findings from cognitive psychology could find their 
way into test design, but few areas of cognitive 
research are ready for immediate translation into 
new achievement tests. Thus, any test designed 
using new cognitive findings is likely to require 
considerable research and development before the 
thinking skills that underlie the test can be measured 
with confidence. 

The emergence of new theories of cognition and 
new instructional strategies raises a fundamental 
question about the nature of the relationship between 
curriculum and assessment. Those who advocate 
reforming tests to more closely parallel new theories 
of learning tend to believe that tests should /oZ/ow 
curriculum and instruction. In this regard, they echo 
principles of educational test design well established 
in the literature of educational measurement.^ The 
first step to improving education, according to this 
view, is to establish what it is students are supposed 
to learn and how they are most likely to team it; the 
next step is to develop instructional approaches; and 
the last step is to develop assessment instruments 
that appropriately measure this content and track the 
learning process. 



Tests as Tools of Educational Reform 

Everyone would agree that there is bound to be 
some back-and-forth motion in this process: decid- 
ing how children are most likely to learn something 
can be informed by assessments of their learning in 
progress. However, another camp of test reformers 
models the relationship explicitly as one in which 
tests drive instruction. Since teachers will teach and 
students will study what is tested, they argue for the 
development of tests covering content children 
should leam; curriculum and instruction wUl then 
fall into place. This section demonstrates how this 
view helped spur the rise of high-stakes tests as 
instruments of policy reform. 




Photo orodlk Amtkm Quldano0 Strvlco. Inc. 

Many educators urge that tasks on tests resemble the sKllls 
students should acquire in school. In mathematico, for 

example, testslikethe one pictured atxiveallowchildren to 
manipulate materials or use tools such as calculators. 

Educational testing has long been viewed as a 
means to enforce accountability, inform education 
policy, evaluate educational progress, and reform 
the structure and content of teaching and learaing.23 
Beginning in the mid-1960s and continuing through 
the 1970s and 1980s, the reliance on tests toward all 
these ends began to increase at all levels of 
government, but especially for accountability pur- 
poses and most frequently at the State level. 

As accountability became a major force in educa- 
tion policy, the response most often took the form of 
rising demand for standardized achievement testing. 
Although many States and the Federal Government 
continued to collect other school performance data 
(such as dropout rates and various economic indica- 
tors), testing was the vehicle of choice. At the 
Federal level, policymakers wrote requirements for 
objective evaluations (usually interpreted as stand- 
ardized tests) into programs of aid to elementary and 
secondary schools. At the State level, legislatures in 
25 States enacted statew. . minimum competency 
tests that affected critical decisions, such as grade 
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CoI^2!iuh«A^l!'*r»i: "t^*" ^^^^M^<""remnl andEvalmHon in Education and Psychology. 3ri cd. (New York, NY: CBS 

^^Sw cb. 4 for a fuller discussion of the history of educational testing in the United States. 
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promotion or high school graduation. And at the 
local level, school boards and school administrators 
began to look at tests as a tool for satisfying public 
demands for accountability, providing information 
about how their students compared to others, and 
gauging their schools' progress toward local goals. 

lb die chagrin of many school people. Federal, 
State, and local district demands for test-based 
accountability data often addressed dii^erent issues, 
with each level of government acting as if data 
collected for tiie other levels was off the mark or 
untrustworthy and making little effort to coordinate 
the multiple testing requirements. It was hardly an 
accident that policymakers embraced standardized 
tests as a means to enforce accountability; this was 
a tradition with roots in the earliest days of die public 
school movement (as described in greater detail in 
ch. 4). 

One of die appealing aspects of tests is tiiat tiiey 
enable outsiders— parents, legislators, and the gen- 
eral public — to leverage die internal workings of 
schools. One commentator has likened tests to 
"remote control" devices, affording poUcyniakers a 
sense of control over classrooms from a safe 
distance.^ Another appealing feature is tfiat testing 
conforms to a logic Uiat sounds right: if die stales are 
high enough, dien teachers and students will change 
tfieir behaviors in ways diat improve test scores, 
leading to increased learning. The facts tiia« tests 
may not be designed to serve tiiis purpose, and diat 
higher test scores do not necessarily mean increased 
achievement, are often overlooked. Finally, test 
scores serve a powerful symbolic function. A steep 
trend line on a graph can be strong ammunit on in 
political struggles over die quality of sc'iiools. 
Whedier die data are reL ible and meaningful, 
diough, are issues that are often relegated to tlie fine 
print once the headlines have left dieir marks. 

A Climate Ripe for GroNA/th 

The reliance on tests as policy tools and die rapid 
adoption of high-stakes testing programs were not 
the result of a carefully coordinated national strategy 
to improve schooling. Radier, diey reflected the 



convergence of several demographic, social, and 
econonuc trends diat began in ttie 1960s. 

Demographic Trends 

The Baby Boomer cohort was a bulge in the 
demographic python. And as it moved Uirough die 
K-12 system in die mid-19608 and early 1970s, it 
created unprecedented demands on school manage- 
ment, particularly in urban and suburban school 
systems, die centers of growdi. As in earlier periods 
of demographic change, expansion of die school 
population led to heightened demand for additional 
sources of information about student achievement, 
over and above the judgments of teachers and 
administrators. Moreover, as access to education 
expanded for minority, immigrant, and low-income 
chUdren, and in die late-1970s for chUdrcn widi 
disabilities, schools came under increased pressure 
to meet die needs of a more diversified student 
population. Fairness in die allocation of educational 
opportunities, always a cornerstone of die American 
public school edios, rose once again to die top of die 
education policy agenda. 

To confront diese demographic changes in an 
efficient way, schools acted in die 1960s and 1970s 
in ways diat mirrored dieir reactions to change in 
decades past: diey looked to die world of business, 
and attempted to adapt techniques such as consolida- 
tion, standardization, classification, and, some might 
argue, bureaucratization. Small districts and rural 
districts tiiat had lost population to urban and 
suburban areas consolidated, between school years 
1963-64 and 1973-74, die nimiyer of public school 
districts in die United States decreased ahnost by 
one-half— from over 31,000 to less dian 16,000. 
Moreover, school systems of all types began relying 
more on tests to obtain information on larger student 
bodies in an efficient and objective manner, as well 
as to make decisions about sorting and tracking 
shjdents widiin diese bigger organizational struc- 
tures. 

Social Trends 

The civil rights movement had a significant effect 
on American education in general and on testing 
policy in particular. In addition to raising issues 
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'6. ^rMrld^dier. Th. Visible Hand: The Managerial Revolution in American Business (Cambridge. MA: Harvard University Pres.. 
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about student classification and disaggregation of 
achievement data, the civil rights movement called 
attention to the vast disparities that existed in the 
quantity and quality of education available to 
children from different racial and ethnic back- 
grounds. It also helped fuel a broader discussion of 
the educational inequities experienced by poor and 
disadvantaged children of all backgrounds, includ- 
ing rural white children, migrant children, and 
limited-English-proficicnt children. 

Passage of the 1964 Civil Rights Act decisively 
settled the congressional battles over desegregation 
that had hampered past school aid bills, and paved 
the way for a significant Federal role in education. 
On the heels of the Civil Rights Act, Congress 
passed a host of social legislation — programs for 
education, welfare, health, labor, housing, and 
nutrition — all aimed at improving the lot of the 
economically disadvantaged. With those programs 
came a renewed interest in survey research and in the 
development of outcome-based measures to justify 
the money being spent,^^ 

Economic Trends: Concerns About 
Competitiveness 

The Nation's reaction to the Sputnik launch in 
1957 foreshadowed the way that school systems 
would respond in subsequent decades to perceived 
threats to America's international competitiveness. 
Looking for ways to explain second-rate technologi- 
cal performance, leaders and the public seized on the 
apparently uninspired performance of American 
students in mathematics and science as a key reason 
why the United States was losing the space race. 
Consensus began to emerge that schools needed to 
place more emphasis on these two subjects. Con- 
gress passed the National Defense Education Act, 
the fu'st substantial influx of Federal aid to ele- 
mentary and secondary education, targeted at mathe- 



matics and science, and also containing a notable 
provision authorizing fimds for guidance counseling 
and testing to identify high-ability students. 

Variations on this pattern of concerns about 
student achievement igniting public debate and 
propelling a nationwide response were to be re- 
peated in later decades. For example, when A Nation 
at Risk linked falling Scholastic Aptitude Ibst 
(SAT) scores with eroding economic competitive- 
ness, it was the States that responded aggressively 
by adopting more rigorous graduation requirements, 
initiating a range of other reforms, and, in some 
cases, providing significant additional funding for 
schools (developments that led to more standardized 
testing, as will be noted later*)^ 

Another trend related to economics merits men- 
tion. In the 1970s, educational researchers began 
applying some of the principles and vocabulary of 
economics to education, assessing the efficiency and 
cost-effectiveness of education in terms of inputs 
and outputs. Most of these studies measured outputs 
in terms of standardized achievement test scores, 
some in conjunction with other quantitative meas- 
ures.^*^ This trend in the academic research mirrored 
the shift occurring in the broader policy community. 
It was during this period that Congress amended 
several Federal programs — including the Job Train- 
ing Partnership Act and the Vocational Education 
Act — to emphasize outcome measures or perform- 
ance standards in program evaluation.^ 

Changes in School Finance: Growth in Federal 
and State Support 

The debut of the Federal Government as a 
significant partner in education during the 1960s, 
and the surge in State reform initiatives during the 
1970s and 1980s, transformed the dynamics of 
school fmance. In school year 1959-60, the lion's 
share of revenues supporting public elementary and 



^*'Sonie proponents of social legislation resisted any accountability, believing that such could not be measured when including the social goals of 
Ihc programs/* Donald Sencse, former assistant secretary for Educational Research and Improvement, personal conmiunication, August 1991 . 

Nation at Risk is among the most cited government reports on education in the past 50 years, and arguably one of the most influential in spurring 
a range of school improvement efforts. It is important to note, however, that the findings in that report did not go entirely unchallenged. See. e.g., L. 
Stcdman and Marshall Snoith, '*Rocent Reform Proposals for American Education,** Contemporary Education Review, vol. 2, fall 1983, pp. 85-104. 

^For a recent review of this literature see Eric A. Hinushek, * *The Economics of Schooling: Production and Efficiency in Public Schools.** Journal 
of Economic Literature, voL 24, 1986, pp. 114M177; ivichard Mumane, * 'Interpreting the Evidence on *Doe8 Money Matter?* ** Harvard Jouma! on 
Legislation, vol. 28, No. 2, sununer 1991, pp. 457*464; and Hecry M. Levin, **Mappittg the Economies of Education: An Introductory Esuy,** 
Educational Researcher, vol 18, No. 4, May 1989. pp. 13-16. It is ii>:'oortant to note that many of the economists woridfi.' \- this field recognizc4 the 
limitations of achievement test scores as outcome measures, but the scores did offer a relatively neat quantitative approach to estimating the input^output 
models of interest. 

^Sec, e.g., U.S. Congress, Offlce of Ibchnology Assessment, "Performance Standards for Secondary School Vocational Education,** background 
^ cr of the Science, Education, and Transportation Program, March 1989, for discussion of the shift to outcome-based measures of public programs. 
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secondary education — ^almost 57 percent — came 
from local sources; States provided 39 percent and 
the Federal Government a mere 4 percent. As shown 
in table 2- 1 , by 1969-70, a few years after the Federal 
Elementary and Secondary Education Act had begun 
channeling over $1 billion annually to schools, the 
Federal share had risen to 8 percent, with States 
holding their own, and local support declining. A 
decade later. States had become the primary source 
of educational revenues, with a share approaching 
47 percent. In recent years, the State share has 
continued to move up as the Federal share has 
declined, so that States now provide about one-half 
the funding for education* 

The increase in Federal and State support brought 
about some important changes in school fmance: it 
helped leduce revenue disparities between school 
districts, which formerly had depended on local 
property tax receipts for over one-half their income; 
and it targeted additional resources to students, 
subject areas, or urgent problems deemed to warrant 
Federal or State attention. But with new money came 
new overseers and greater demands for measurable 
results. A principal source of Federal accountability 
requirements was ''compensatory education,*' a 
program created in 1965 by Title I of the Elementary 
and Secondary Education Act. Renamed Chapter 1 
in 1981, this program has been the cornerstone of 
Federal aid to elementary and secondary schools. 
From the beginning, legal requirements to evaluate 
the effectiveness of this program in meeting the 
educational needs of educationally disadvantaged 
children have resulted in increased reliance on 
standardized norm-referenced tests. As discussed in 
depth in chapter 3, the Federal Government has had 
a powerful impact on U.S. testing practice because 
of the evaluation and reporting requirements of 
Chapter 1 legislation. 

Developments in the Testing Industry 

Economic trends influenced assessment in yet 
another significant way. Advances in testing tech- 
nology and psychometric research, accompanied by 
expansion of the testing industry, made wide-scale 
testing more affordable for school districts and more 
profitable for testing companies than ever before. 
While technological, research, and corporate devel- 



Table 2-1— Sources of Revenues for Public Etomentary 
and Secondary Schools (In percent) 





1950-60 


1060-70 


1070-80 


1087-88 




4.4% 


8.0% 


0.8% 


6.3% 






30.0 


46.8 


40.5 




56.5 


52.1 


43.4 


44.1 



SOURCE: U.S. D«p«rtm#nt of Education, National Cantar tor Education 
Statlttlca, Dlg0ttofEdu<M^nalSMiUo$, tMO (Washington, 
DC: 1091), p. 147. 



opments alone did not create the demand for 
testing — that demand existed \vell before the advent 
of specific scoring or testing technologies — they 
provided powerful efficiency arguments in favor of 
standardized, machine-scorable tests. 

But at the same time as machine-scorable testing 
was gaining ground as the vehicle of choice to 
manage the assessment demands of the period, 
curriculum experts and educational psychologists 
were busy crafting revised theories of human cogni- 
tion and teaming (as discussed above). Indeed, they, 
too, were strongly influenced by the apparent 
decline in American students' performance — 
compared to students in other nations — and by the 
fear of America's irreversible loss of international 
competitiveness. Their response, though, was to 
rethink thinking, and among the results emerging 
from this evolving line of research are prescriptions 
for radical changes in the technologies and uses of 
educational assessment. 



The Net Result 

Taken together, these demographic, economic, 
and social factors created a climate in which the use 
of tests as policy tools could take root and thrive. As 
summarized in a . jminal National Academy of 
Sciences report: 

The most significant development in management 
(and testing) in recent years has been the increasing 
demand for central oversight of educational results. 
This comes partly because of the increased reliance 
of local schools on State fiinds since the late 19608i 
partly because education has come to be viewed 
explicitly as a weapon with which to combat poverty 
and increased equality, and partly because of a 
suspicion that teachers and local administrators are 
falling down on the job.^ 



^AJcxaodcr K. Wigdor and WcDdcU R. Oaroer (oda ). Ability Testing: Uses, Consequences, and Controversies^ part 1, repoit of the commlltco 
ahington, DC: National Academy Press, 1982), p. 170. 
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States^ TestS; and Minimum 
Competency 

Although the Federal Government has wrought 
changes in education of indisputable importance, the 
main arena for the events commonly Aought of as 
the school reform movement has been the States. 
Education reform can mean many things and can be 
conducted in quite different ways. In general, the 
terai connotes efforts to improve the quality of 
educational outcomes throu^ changes in one or 
more aspects of the school system. Some reforms, 
such as the decentralization of decisionmaking that 
took place in the New York City schools in the 'ate 
1960s,^ address the actual organization of school- 
ing. Others focus on curriculum, teacher or adminis- 
trator salary structures, or student tracking and 
grouping policies.^^ 

Spurred by public demands for more accountabil- 
ity in education. States have taken on new and 
increasingly activist roles in education — and in 
education reforms-over the past 15 years. In gen- 
eral. State-initiated reforms of the 1970s were ''top 
down'* in nature: States identified their priorities, 
often in the forums of the legislature and State Board 
of Education, and set standards for all loc^^l school 
systems. 

Ibsts have been essential components of most 
State-mandated reforms and have been asked to 
fiilffll many new functions, such as detomining the 
allocation of resources or persuading individuals 
and organizations to cliange behavior. In fact. States 
have been the main practitioners of high-stakes 
educational testing. For these reasons, die State 
experience with mandated reforms is a good illustra- 
tion of some of the effects of externally developed 
standards on educational practices. 

Minimum Competency Testing: Definition 

Perhaps the most significant manifestation of the 
vigor with which States approached reform was the 



grov^ of minimum competency testing (Mr^"" that 
occulted during the late 1970s and continued into 
the 1980s. MCT refers to programs mandated by 
State or local agencies that have the following 
characteristics: 

• All or almost all students in designated grades 
take pq)er-and-pencil tests designed to meas- 
ure a set of skills deemed essential for future 
life and woik. 

• The State or locality has established a passing 
score or acceptable standard of performance on 
these tc-sts. 

• The State ft locality may use test results to: a) 
make d isions about grade-level promotion, 
high sciiool graduation, or the awarding of 
diplomas; b) classify students for remedial or 
other special services; c) allocate certain funds 
to school districts; or d) evaluate or certify 
school districts, schools, or teachers.^^ 

Within this general framework, minimum compe- 
tency tests can vary greatly in their design, format, 
uses, and applications to high-stakes decisions. 

Impetus for MCT 

MCT is a genuine example of a grassroots 
phenomenon, with the impetus coming mostly from 
outside the educational system.^^ Fueled Gtsi by 
popular writers, employers, and the media, and later 
by a proliferation of education reform panels, a 
movement began to catch fire among parents and 
other citizens who were already somewhat disillu- 
sioned with the schools. In the minds of this group, 
the symptoms of educational distress were idl 
around, apparent to anyone who dared open his eyes: 
standaids had been relaxed to the point that a high 
school diploma no longer meant anything; students 
were leaving school without the basic reading and 
mathematics skills they needed to succeed in work 
or higher education; pupils were being promoted to 
higher grades automatically, regardless of achieve- 
ment; too little time was being spent on instruction 
and too much on ^'frills''; and too many teachers 



30Sec» eg,, Diane Ravitch, The Great School Wars: New York City, 1805-1973 (New York, NY: Basic Books, 1974), especially pp. 251-404. 

^•For a review of recent school reform efforts, sec, c.g.. Educational Ibstiing Service, The Education Reform Decade, policy infomuuion report 
(PrincetODi NJ: 1990). For analysis of the role of testing in the reform movements of the 1970s and 1980s, see Douglas A. Archbald, Univenity of 
Delaware, and Andrew C. Porter, University of Wisconsin* Madison, * * A Retrospective and an Analysis of the Roles of Mandated Ibstiug in Education 
Reform,** OTA u)ntractor iqwrt, January 1990. 

^%onald A. Deik, '^Minimum Competency Ibstbg: Status and Potential/* The Future of Ttsting, Barbara S. Plake and Joseph C. Witt (eds.) 
(Hillsdale. NJ: L. Eribaum Associates, 1986), pp. 88-144. 

^ 33 Archbald and Porter, op. cit., footnote 3 1 . See also Barbara Lemer, ' 'Good News About American Education, ' * Commentary, vol. 91 , No. 3, March 
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were incompetent.^ A symbol that became inextri- 
cably linked with de teriorating educational quality 
and perhaps more responsible than any other for 
erosion in public confidence was the steady drop in 
SAT scores that began in 1963 and persisted through 
the 1970S.35 

This movement, which led to the adoption of 
MCT by many States, was an outgrowth of the 
**back to basics*' movement of the 1970s— itself a 
backlash against the educational experimentation 
and general social permissiveness that had charac- 
terized the previous decade. A public grown suspi- 
cious of such innovations as schools without wsdls 
and student-centered learning, or the elimination of 
dress codes and the expansion of electives, came to 
believe that major changes — more rigorous stand- 
ards, a curriculum rooted in the **three Rs'* — ^were 
needed. But many people believed that since local 
teachers and administrators were part of the prob- 
lem, they could not be relied on to make the needed 
reforms without outside pressure. Seeking support 
from the Federal Government was an unappealing 
altemative to those who feared an infringement on 
State and local control of education or the enactment 
of Federal mandates. 

Eventually public pressure focused on the States 
as the level of govonment best positioned to direct 
education reform. State Government was close 
enough to grassroots to understand '^'^nununity 
standards and needs, but possessed enough authority 
to put pressure on recalcitrant school districts. It was 
largely elected State officials — State legislators and 
State Board of Education members — ^who found 
themselves at the center of the debate over education 
reform. It is significant that elected officials, more 
than professional educators, took the lead on MCT. 
Many State legislators were aheady sympathetic 
with the back to basics movement and were willing, 
even anxious, to show their support through spon- 
soring legislation. In addition, the fact that State 



legislators were not part of the educational establish- 
ment may explain their faith in the power of tests to 
bring about major change in education. Finally, as 
some researchers have observed: **As non-educa- 
tors, enthusiasts of competency testing [were] free to 
focus on the results and to pay little heed to the 
processes by which they might be achieved. ' State 
legislators may have viewed this freedom as a plus; 
by enacting MCT they could appear to be doing 
something significant about education reform with- 
out seeming to encroach too much on local control 
or venture into instructional areas they knew little 
about. 

The basic idea behind MCT was an appealing one 
to many State policymakers. In developing the tests. 
States could create some unifonii, external ?xtandards 
that emphasized those skills deemed especially 
important to literacy and life success. By further 
tying these standards to promotion, graduation, or 
other educational way stations, it would focus 
instruction and learning on critical areas.^^ 

The Rise of MCT 

By the mid-1970s, the climate was ripe for action 
in many States. States had already begun to pick up 
a greater share of the costs of education, and the 
principle that he who pays the piper calls the tune is 
a time-honored one in the educational arena. And in 
many States, the use of tests as accountability tools 
was a well-established principle (witness the exis- 
tence of State licensing examinations in a range of 
professional fields, or the State Regents' examina- 
tions in New York). In addition, early MCT pro- 
grams in Denver, Florida, and Georgia had set a 
precedent and piqued the interest of policymakers 
from other States. 

The major expansion in MCT that occurred during 
the 1970s and 1980s was a watershed event in testing 
policy. Prior to 1975, only a few States mandated 
MCT. The peak growth period for statewide compe- 



^Berk* op. citn footiiote 32. 

3^Gcorge Madaus, ' "Ibsting and Policy— True Lx)ve, Shot Gun Wedding or Marriage of Convenience?* * paper presented at the annual meeting of the 
National Council on Measurement in Education, New Orleans, LA, April 1984. The sudden (and short-Uvod) upturn in Scholastic Aptitude Tfest scores 
beginning in 1979 is evidence for some analysts of the elffectiveness of the minimMm competency testing nK)vemcnt See Lemer) op. cit.. footnote 33, 
for the most ardent formulation of this causal argument. 

3«WaltHaney and George Madaus, "Making Sense of the Competency Tbsting Movcmcn^• ' Harvard Educational Review^ vol 48. No. 4. November 
1978. 

^''Critics took a much dimmer view of what they saw as the real function of minimum competency testing: ' ' When penalties associated with failing 
a certification test are severe enough, instruction and study will adjust to prepare pupils to pass it. The test becomes a coercive device to influence both 
the curriculum and instruction. Unleashing the fear of diploma denial or retention in grade bullies the instrs.ctional delivety system into line/ * R Airasian 
,^^4 G. Madaus, "Linking Tfesting and Instruction: Policy Issues,** Journal of Educational Measurement, vol 20, No. 2, summer 1983. 
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tcncy testing between 1975 and 1979 (see figure 
2-3). In fact, MCT accounted for most of the overall 
growth in educational testing in the post- 1975 era. 
By 1980, 29 States had implemented legislation that 
required students to pass criterion-referenced exanu- 
nations and 8 more had such legislation pending.^^ 
Some States used the examinations to determine 
eligibility for remedial programs and promotions 
and some required it for graduation. By 1985, 
growth in such programs had leveled off, although 
33 States were still mandating statewide minimum 
competency testing; 11 of these States required the 
test as a prerequisite for graduation.^^ 

Mmimum competency tests were altogether dif- 
ferent creatures from the **off-the-shelf ' norm- 
referenced achievement tests that had dominated 
standardized testing up to that point. Most MCT 
instruments were custom-made in State education 
offices or by vendors working from State specifica- 
tions, and unlike commercial tests, were designed 
from the start as high-stakes instruments. Most 
States required students to achieve a predetermined 
passing score for grade promotion or diploma 
receipt; usually students were allowed to take the 
test over if they did not obtain a passing score the 
first time. Some States mandated remediation for 
students who did not pass, while in other States it 
was optional. 

Minimum competency tests are criterion-refer- 
enced; they measure performance in relation to 
specified skills objectives in such areas as vocabu- 
lary, reading comprehension, matheknatical compu- 
tation, and, in some cases, functional skills (filling 
out a job application, for instance, or conducting 
simple fmancial transactions). The multiple-choice 
format is by far the most common, although some 
competency tests use other approaches, such as 
essay writing, oral examinations, and problem 
solving. 

Two other features distinguish MCTs from other 
types of tests. First, because they use specific 
passing scores, they require some type of standard- 
setting process to determine and justify the ''cutoff 



Figure 2-3— Number of States Conducting Minimum 
Competency ibsts 



Number of States 




SOURCE: U.S. Congress, Office of Technology Assessment, ''State 
Educational Testing Practices," background paper of the Sci- 
ence, Education, andTransportation Program, December 1 987; 
supplemented by data from Ronald A. Berk, ''Minimum Compe- 
tency Testing: Status and Potential," The Future of Testing, 
Barbara S. Plake and Joseph 0. Witt (eds.) (Hillsdale. M: L. 
Ertbaum Associates, 1986), p. 96. 

score. ' Since there is no fixed^ scientific approach 
to determining what knowledge a person needs to 
''function** in society, this can be a murky process. 
Second, MCT instruments are always administered 
on a census basis: each student takes the test. This 
does not mean, however, that the tests are not also 
used as instruments of school-level accountability. 
Many States and districts aggregate individual 
student scores to derive passing rates or average 
scores for entire schools. The demand for this type 
of comparative information has actually increased, 
with business leaders and policymakers often link- 
ing support for expensive reform packages to the 
willingness of State Education Agencies (SEAs) and 
school districts to accept public disclosure of test 
results. (Nineteen States now produce public reports 
comparing districts or schools on State test re- 
sults.^0 

The Second Wave of State-Mandated Reform 

A Nation at Risk and other reform reports of he 
1980s set in motion a second wave of State- 



^^Bcrk, op. cit., footiK)te 32. 

^''U.S, Congress, Ofifice of Ibchnology Assessment, **Statc Educational Ibsting Practices/* t>ackground paper of the Science, Education, and 
Transportation Program, December 1987. 

^See ch. 6; Robert Linn« George Madaus« and Joseph PeduUa, * ^Minimum Con^K^tency Tbsting: Cautions on the State of the Art/ * American Journal 
of Education, November 1982, pp. 1-35: and Richard Jaeger, **An Iterative Structured Judgment Process for Establishing Standards on competency 
Tfests: Theory and Application,** Educational Evaluation and Policy Analysis, vol. 4, No. 4, winter 1982, pp. 461-475. 

Q ^^Archbald and Porter, op. cit., footnote 31. 
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Pholo cr$dH: Anmu Sonnlor 



ThB 19708 witnessed Increased public demands for 
accountability In education. States Initiated refbrnns that 
Included mandated tests, as well as course and graduation 
requirements. Students In this ninth grade algebra dass 
are required to take the course by Ijoulslana State law. 

mandated reform. Reacting to criticisms that not 
enough students were taking advanced courses in 
science, mathema4cs, foreign languages, and other 
areas deemed critical to American international 
competitiveness, States assumed greater control of 
graduation requirements, making them more rigor- 
ous."^^ In addition. States pushed for and obtained 



more authority over curriculum, usually making 
them more prescriptive and enforcing a greater 
degree of consistency across the State.^^ Many 
States with statewide (rather than locally deter- 
mined) textbook adoption policies also began scruti- 
nizing more closely the match between their text- 
books and their curriculum guidelines.^ 

Under public pressure to demonstrate gains in test 
scores, some States also undertook major ''curricu- 
lum alignment'' efforts, whic linked curricular 
objectives, textbooks, lessons, instructional meth- 
ods, and assessment. Curriculum alignment is a 
common strategy at the classroom and school level, 
but it is only recently that entire districts and States 
have experimented with it. The idea behind curricu- 
lum alignment is straightforward: if the goal is to 
improve test scores, then instruction should focus on 
what is tested. At the State level, however, alignment 
is not always easy to achieve. SEAs must contend 
with traditions of local curriculum autonomy and 
wide differences among school districts according to 
a whole range of characteristics. Moreover, the local 
variables that affect course content and classroom 
instructional practice are not easily influenced by 
State policies. 

Nonetheless, many States have gradually tight- 
ened control over those curriculum variables that 
they can influence. Disi lets under pressure to raise 
test scores on State tes have done the same."^^ In 
practice, curriculum alignment can range from State 
officials selecting a norm-referenced test based on 
how well it matches with loosely defined State 
education goals, to States conducting exhaustive 
content analyses to ensure detailed matches among 
tests, curriculum, and textbook objectives. Off-the- 
shelf standardized tests — the stt^le of State testing 
for decades — increasingly were augmented or re- 



^^WUliam Cluiie» Paula White, and Janice Pftltenon, The Implementation and Effects of High School Graduation Requirements: First Steps Toward 
Curricular Rrform (New Biuoswick, NJ: Rutgen, The State Uoivefsity of New Jersey, Center for Policy Research in Education, 1989). 

43ln a survey of 27 State social studies specialists, 26 said course requirements and guidelines had become more speciflc in the last 4 to 5 yean. The 
investigators concluded: "Despite great differences amoQg the states, a vety strong generalization emerges from the study, namely, that the current 
'flavor* of social studies throughout most of the country is highly prescriptive. Many prescripts have been applied in recent years to students, teachers, 
and cunicula/* Council of State Social Studies Specialists, Social Studies Education, Kindergarten*Gradr 12 (Washington, DC: National Council for 
the Social Smdies, 1986). 

^Harriet lyson-Bemstein, A Conspiracy of Good Intentions (Washington, DC: Council for Basic Education, 1988); and Hairiet lyson-Bemstein, 
"Three Portraits: Ibxtbook Adoption Policy Changes in North Carolina, Ibxas and California,** occasional paper for the Institute for Educational 
Leadership, 1989. 

^Ken Komosid, director of the Educational ProducU Inforauuion Exchaoge, as cited by Lynn Olson, ' 'Districts TUm to Nonprofit Group for Help 
in 'Realigning* Curricula to Parallel Itets,** Education Week, vol 7, No. 8, Oct. 28, 1987, pp. 17, 19. Ibxtbook manufacmrers market thehr books in 
'*big-ruarket** States and districts by demonstratlDg (in documentation and in sectiom of the books themselves) the alignment of their textbook content 
Q h 6<ete curriculum frameworks through ''correlational analyses.** 
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Tablo 2-2— Improvements in Student Achievement Associated 
With Curriculum Alignment 

Gain* 

Lx>cale Subject Grade Period (in percent) 

Aiabama 3R8 3, 6, 9 1881-86 1-13% 

3RS 11 1983-85 4-8 

Connecticut 3R8 9 1960-84 6-16 

Detroit 3R8 12 1981-86 19 

Maryland 3R8 9 1980-86 13-25 

Social studies 9 1983-86 23 

New Jersey Readlng/nrttithenDatlcs 9 1977-85 16-19 

Reading/nrttithenrtatlcs 10 1982-85 8-11 

South Carolina Readiness 1 1979-85 14 

Readlng/nrmthematics 1 -3> 6, 8 1981-86 12-20 

ARgurM repreMnt th« lncrM8«d p«ro6ntag9 of studants who havo mastorod standards of quality during tha pariod In 
quastlon. 

SOURCE: W. Jamas Popham, Tha Marlts of Maasuramant-Drlvan Instruction/' Phi Mia Kappan, vol. 68, No. 9, May 
1987, pp. 679-682. Nota, numbars In right-most column danota tha ranga of parcantaga Incraasas across 
tha diffarant grada {avals and tasts In columns on laft 



placed by custc .-developed tests designed to assess 
State curriculum guidelines and goals. 

MCT : Lessons for High-Stakes Testing 

One problem with drawing conclusions about the 
effects or influences of State-mandated tests on 
school improvement is that testing is but one of 
many forces that shape the learning experiences of 
young people* Indeed^ mandated testing is as much 
a result of widely held beliefs about cuiriculumi 
teachingi and leaming as it is a cause of educational 
outcomes. 

Even so, researchers have made some thorough 
analyses of State experiences with MCT and other 
State-mandated reforms and drawn some conclu- 
sions about their effects. In general, these research- 
ers have concluded that the movement, which began 
amid such optimism, has produced results that are on 
the whole disappointing. A summary and analysis of 
key findings from studies of MCT are summarized 
below. 

Test Score Gains 

A number of States and districts can point to gains 
over time on minimum competency and other State 
tests. Gains tend to be more apparent in districts and 



States that have systematically pursued test and 
curriculum alignment. For example, on the Ibxas 
Assessment of Basic Skills in mathematics, 70 
percent of ninth graders achieved mastery in 1980; 
by 1985, the figure had risen to 84 percent. On the 
reading portion of the same assessment, passing 
rates increased from 70 percent to 78 percent during 
the same period.'^ Similarly, in South Carolina, the 
percentage of first graders passing the basic skills 
reading test rose from 70 percent in 1981 to 80 
percent in 1984, and for mathematics the passing 
rate went from 68 to 81 percent during the same 
period^'' (see table 2-2). 

Impressive as these gains might be, their credibil- 
ity was severely undermined by analysts who looked 
more closely at the timing and generality of the 
trends in test scores.^ Among the findings in this 
body of research, the most damning to the MCT 
movement were: 1) that scores on some tests in some 
places rose more rapidly and more significantly than 
in other places, 2) scores rose on tests even in States 
without MCT,^^ 3) scores began to rise before MCT 
could have had much impact, and 4) all States were 
reporting performance of their students on nationally 
normed achievement tests above the national aver- 
age, a statistical impossibility (see box 2-D). 



^^Office of Ibchnology Assessment, op. cit., foottMte 39, p. 272. 

47See W. James Popbam, Keith L, Cnise, Stuart Ranldii, Paul Sandifer, and Paul L. Williams, ''Measuitmetit Driven Instniction: It*8 on the Road/ * 
Phi Delta Kappan, vol , 66, 1 985, pp. 628-634; cited in Lorrie Shc^ard and Katharine Dou^rty, * *B£fects of High-Stakes Ibstiog on Instruction, ' ' paper 
presented at the annual meeth^ of the American Educational Research Association, dicago, IL, April 1991. 

^See especially DanielKoretz, Trends in Educational Achievement (Washington, DC: Congressional Budget Office, April 1986); and Congressional 
Budget Office, Educational Achievement: Explanations and Implications of Recent Trends (Washington, DC: August 1987). 

Q ^'See also Gerald Bracey, rejoinder to Bartrara Lcmer, Commentary, vol. 92, No. 2, August 1991, p. 10. 
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Box 2-D— The Lake Wobegon Effect: All the Children Above Average? 

Inndio personality Garrison KeiUor's fictional town of Lake Wobegon, "all the women are strong, all the men 
are good-looking, and all the children are above average." lb statisticians, of course, average is simply a 
rq)resentati(»i of central tendency, and is a point drawn from an array of numbers. In many noim-referenced tests 
(NRTs), average rei^esents the "median" and shows that one-half the test takors scored above diis point and 
one-half below. It is statistically impossible for everyone to be above average— 4)ut "above average" is in some 
sense an American ideal. 

The word average connotes a certain hum-drum, undistinguished level of achievement, especially when 
applied to people. Just as the citizens of mythical Lake Wobegon want all their children to be above average, 
teachers, principals, and parents want to show that their children are doing well 

Thus, the desire for higher test scores may overwhehn the desire to improve actual learning. Similariy, in 
reporting scores, calculations and methods may be used that do not give a full or accurate picture. Such excessive 
emphasis on test scores can compromise the value of information, as weU as give misleading views of how children 
and schools "rank" with regard to one another. For example, students and teachers may focus their efforts on 
improving performance on samples of what is to be learned, rather than on the boify of knowledge from which the 
samples are drawa, and rising test scores may then be enoneously inteipreted as reflecting genuine gains in 
achievement. Schools w districts seeing the scores of their students rise may be lulled into a false sense of 
complacency. 

Or consider another possible example of how test scores used alone can lead to inaccurate inferences about 
achievemmt gains. A school system adds a number of academic high school course lequirr ments in order to increase 
achievement levels. After several years, test scores go up c«isiderably and administrators conclude that increased 
course requirements have raised achievement levels duoughout the district. However, diis gain has been attained 
at the expense of a number of low-achieving students dropping out True achievement has not risen; but the lowest 
scoring students are no longer represented in the data. In this case, achievement test scores examined in combination 
with another achievement indicator (drop-out statistics) might have denuxistrated that the gains were artificial. 

The so-called Lake Wobegon phenomenon is by now a familiar example of how excessive focus on test scores 
can provide misleading information. Issued in 1987 by a group caUed the Friends for Education, the Lake Wobegon 
report asserted that all States rqwrting statewide test scores ranked above the natimal average; however, many of 
these same States were doing veiy poorly on other indicators such as graduation and literacy rates.* 

The Lake Wobeg(Mi report sparked controversy and debate; critics charged that the report contained many 
inaccuracies and misunderstandings of the technical nature of test scores. Although subsequent analyses by testing 
experts have acknowledged that such errors do exist in the report, they have largely confirmed the basic conclusions 
of the Lake Wobegm report— achievement test scores can give a highly exaggerated picture of achievement^ 
Although the causes of the problem are complex and are difficult to collect data about some of the most 
well-understood contributions to the Lake Wobegm phenomenon are shown below. 

Dated norms. Befwe a standardized NRT is released, it is administered to a national sample of students to 
obtain "norms"— that is, the distribution of scores for children across the Nation. That set of norms, which acts 
as a national standard, will then be used for about 7 years before a new form of the test is developed and 
"re-normed" on a new sample of children. When there are upward trends in genuine «;hievement old norms 
become easier to master because children know more than those in prior years.^ When old norms are used, the 
average performance of students today is being compared with students wlio took the test up to 7 years ago. Thus, 
today's children will appear above average. 



* J. Cannell. Nationally NormedElemenuuyAchievemem Testing in America's PiMIc Schools: How All 50 Stales Are Above the National 
Average (Daniels, WV: Friends for Educatioo. 1987). 

^See Daniel Koretz, "Airiving in Lake Wobegon; Are Standardized Ibsts Exaggeiatiog Acliievement and DisUMting loatruction?" 
American Educator, vol. 12, No. 2, summer 1988, pp. 8-15, 46-52; Robert L. Linn, Elizabetti Graue, and Nancy M. Sanden, "Comparing State 
and District Ifest Results to National Norms: InterpreUtions of Scoriitg 'Above the National Average,'" paper presented at the annual meeting 
of the American Educational Research Association, San Rancisco, CA, March 1989. 

^See, e.g., Linn et al., op. cit., footnote 2. 
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Repeated use of nonsecure tests. Because the same tests are present in the district and given over a poiod 
of years, teachers and students become increasingly familiar with the test questi(ms. This is one of the £actois that 
can contribute to a very focused ' 'teaching to the test and leads to the difficulty in defining the gray area between 
legitimate test preparation activities and outright cheating (e.g., by having students practice actual test questions). 
The demarcation between legitimate test preparation activities (e.g., giviqg practice, coaching, and explanation of 
instructi(ms to students) and dubious or even un«^cal practices nuty vary fiom school system to school systent^ 
Even if all test preparation activities are l^itimate and teaching to the test is minimized, however, some gains can 
probably be attributed to the increased familiarity with a particular form of a test that comes with use of a single 
test over a number of years. 

Selection of closely aligned tests. Standardized achievement tests vaiy in content, emphasis, and form. 
Administrators typically select % ^; that most closely match the cunicular objectives of their State or district. 
Students will tend to score higher on a test that is closely aligned with their own curricula than will students who 
have been taught a diiKcient, less closely aligned curricula. Because the nonning group of any test is composed of 
schools which vary in their degree of alignment, a district with a highly aligned cuiriculum will score higher than 
the norming group. Thus, administratois who select a highly aligned test, or have a customized test made for them, 
will often find their students scoring better than the national nonning group * * . . . tvta if their level of achievement 
is in some broader sense equivalent, simply because their curricula match the test moie closely and thus prepare 
them better for it.**^ 

Selection of students to be tested. Ibsting manuals usually explain that certain students, such as non-English 
speakers or special education students have been excluded from the norming sample. However, when the tests are 
being administotd in schools, specific decisions about which children to exclude— who has mastered Eqgiish well 
enough to take the test, for example— have to be made at the district and school level. Because many of the students 
who will be excluded (including tmant or chronically absent children) will score well below average, these decisicms 
can have a major impact on a cchool or district's average score. Schools that decide to exclude all sudi students are 
likely to have a higher average than schools with policies that attempt to include all students for whom the test can 
be considered valid. If the exclusionary policies for a district are more liberal than those used to obtain the nonning 
sample, that district is likely to appear ^*above average.** 

Although embarrassing to some State policymakers, the Lake Wobegon report illustrated the potential mischief 
caused by high-sti^s testing: higher test scores without more learning. And since the publication of the original 
study, other researchers have replicated the basic result. For exanq[>le, one recent longitu<Unal study of a laige urban 
district that uses a high-stakes commCTcial achievement test found that the iminroved perf<xmance seen over a 4-year 
period on that test was not confirmed when a different test was also administered in the fourth year. Preliminary 
data indicate that the . . results of this district's high-stakes test overstate achievement [in mathematics] by as 
much as 8 academic months by the spring of grade 3. * *^ Pblicymakers (and the public) are interested in mathematics 
achievement broadly defined, not just as defined by one particular test. These results . .infoimaticm 

provided to the public by accountability-oriented tests can be seriously misleading.''^ 

The Lake Wobegon episode taught policymakers and the testing community a number of important lessons 
about norms, test selection, teaching to tlie test, and the distorting effects of high-stakes testing. Perhaps the greatest 
significance of the phenomenon was to demonstrate the validity of a wamiiig that has been provided by educational 
testing experts for notany years: no single test should ever be the basis for important policy decisicms about schools 
or individuals.^ 



^For views on the diffeietice between ethical and unethical test prepimtion activities see William A. Mehreos and John Kamimi M, 
"Methods for ^mpiovtog S t a n da r dized Ibst Scwcs: Fniitfiil, Fniitless, or Fraudulent?** Educational Measurement: issues and Practice, voL 8, 
spring 1989, p^. 14-22; and Thomas M. Haladyna. Susan B. Nolen, and Nancy S. Haas, ' 'Raising Standardized Achievement Ibst Scores and 
the Origins of 'Rst Score Pollution," Educational Researcher, vol. 20, No. 5, Jui^July 1991, pp. 2-7. 

%oretz, op. cit, footnote 2, p. 14 

^Daniel Kocetz, Robeit Linn, Stephen Dunbar, and Lorrie Shepard, "The Bffecu of High Stakes Ibsting On Achievement: Preliminary 
Findings About Generalizations Across Ibsts," paper presented at the annual meeting of the American Educational Reseatch Association, 
Chicago, IL, April 1991. 

^Sec, e.g., Anne Anastasl, Psychological Testing (New York, NY: Macmillan Publishing Co., 1988). 
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Proponents of high-stakes testing, however, counter 
these arguments with data from the National Assess- 
ment of Educational Ptogress (NAEP). Unlike the 
high-stakes tests, for which score increases can be 
attributed to test-taking skills rather than genuine 
achievement, NAEP trends are considered by most 
experts as a better gauge of trends in achievement.^^ 
Thus, the fact that NAEP scores have gone up in the 
1970s and 1980s has become a linchpin in the 
pro-MCT argument.^* 

But, once again, closer inspection of the timing 
and significance of NAEP trends suggests a more 
complex picture, one that defies simple attribution to 
MCT or any other single policy. First, NAEP scores 
did rise in the 1970s and 1980s, but the rise actually 
began to be noticed as early as the 1974 assessment, 
well before MCT was in operation in all but one or 
two States. 

Second, the magnitude of the rise was consider- 
ably less impressive than the magnitude recorded on 
other standardized tests. Although some might argue 
that NAEP underestimates true achievement be- 
cause NAEP test takers perceive no particular 
incentive to do their best, even correcting for this 
possibility would not erase the large gap between 
increases on other tests and the increases on NAEP. 

Third, the most impressive aspect of longitudinal 
analysis of NAEP scores is the narrowing of the 
achievement gap between minority and white stu- 
dents: **. . . the average achievement of Blacks and 
of Hispanic students is substantially higher now than 
a decade ago. * nhis is hailed by some as the most 
convincing proof of the value of MCT,^^ while 
others note that: 1) the nairowing of the gap is 
explained largely by improvements at the low end of 
the range of achievement, 2) the overall gap between 
achievement of minority and white students remains 
quite large, and 3) gains among minority students in 
basic literacy and numeracy skills may have come at 
the expense of gains in higher order skills, which, 
according to NAEP data have been stagnant at best. 



Undue Emphasis on Basic Skills 

Prompted by these trends in NAEP, a number of 
researchers have investigated thw hypothesis that 
basic skills improvements may have been made 
possible by a shift of instructional resources away 
from higher order academic skills. NAEP reports, 
for example, have emphasized the lack of progress 
in so-called higher order skills during the period of 
progress in basic skills. But other studies have been 
more optimistic. Researchers working with the Iowa 
Ibsts of Basic Skills, for example, produced evi- 
dence contradicting NAEP*s: performance of com- 
parable samples of 9-, 13-, and 17-year-olds in- 
creased between 1979 and 1985 on higher order 
questions even more than on basic skills items, 
continuing a trend observed from 1971 on.^ 

Contradictory evidence about test j«:ore trends 
notwithstanding, there is widespread agireement that 
State-mandated testing, and MCT in particular, had 
damaging effects on classroom behavior of teachers 
and students. One study combined analysis of survey 
data and intensive interviews with tt^hers and 
^hool administrators, and concluded that the testing 
reinforced the already excessive emphasis on basic 
skills and stymied local efforts to upgrade the 
content of education being delivered to all students. 
The authors of this study write: 

Although [the] ability of a Statewide testing 
program to control iocal activity may be praisewor- 
thy in the minds of some educational critics, the 
activity the program stimulated was not reform. 
Responding to testing did not encourage educators to 
reconsider the purposes of schooling; their purpose 
quickly became to raise scores and lower the 
pressure directed toward them. Responding to test- 
ing did not encourage educators to restructure their 
districts; they redirected time, money, and e^rt so 
that some parts of their systems could more expedi- 
tiously address the test score crisis while leaving the 
parts unaffected by testing or producing ^good* 
scores unscathed. Responding to testing did not 
enr 'irage educators to rethink how they should 
teach or how they should administer schools; once 



^or a fiiUcr discussion of the origins and technical characteristics of the National Assessment of Educational Progress, sec ch. 3. 
^^See Lemer, op. cit., footnote 33. 

^2Robcrt UmiandStcphcn Dunbar/*The Nation^s Report Card Goes ho 
vol. 72, No. 2, October 1990, pp. 127-133. 

^^See Kjemer, op. cit., footnote 33. 

«Sec EUzabcth Witt, Myunghec Han, and H.D. Hoover, * 'Recent Trends in Achievement Tfest Scores: Which Siadents arc Improving and on What 
^- •-Is of Skill Complexity?** paper presented at the annual meeting of the National Council on Measurement in Education, Boston, MA, 1990. 
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again they addressed process only in tlie parts of their 
system that felt the direct impacts of testing.^^ 

Narrowing Effect 

While there is agreement among many studies of 
MCT that local districts have changed curriculum, 
instroctional methods, and textbooks to align them 
more with the content of MCT instruments, there are 
differences of opinion about whether this is a good 
or bad trend. Some studies have bemoaned the 
narrowing effect that MCT seems to have had on 
instructional strategies, content coverage, and 
course offerings. The values embodied by MCT— 
that there is a fixed body of knowledge that students 
must absorb by a certain age, that mastery of this 
content is reflected in student responses to paper-and- 
pencil tests, and that student failure on the test is the 
school's responsibility to correct — tend to reinforce 
educational practices that are mechanical, superfi- 
cial, and fragmented, such as passive learning, drill 
and practice, and adherence to age-grade distinc- 
tions and subject-matter boundaries.^^ Moreover, 
alignment to a State standard does not refiect the 
meaningful differences between localities. 

Effects on Achievement and on Teacher 
Behavior 

Recent research suggests that improvements on 
high-stakes tests do not generalize well to other 
measures of achievement in the same domain. For 
example, in one study mathematics performance on 
a conventional high-stakes test was found to not 
generalize to other tests for which students have not 
been specifically prepared. The authors of this study 
caution, therefore, that: **. . . information provided 
to the public by accountability oriented tests can be 
seriously misleading. "^"^ The evidence is somewhat 
contradictory about the extent to which teachers 



modify their instmctional practices in ways that are 
likely to produce higher test sco u One-half of the 
respondents to one nationally representative survey 
of eighth grade mathematics teachers (n=552) said 
they did not prepare students at all for mandated 
tests; of those who said they did, almost one-half 
reported spending no more than several periods a 
year on these efforts (and mathematics is one of the 
most tested areas).^^ It is also important to note, 
however, that of the group who said that testing 
influenced their instruction, 30 percent said they 
increased basic skills emphasis; 24 percent said they 
added emphasis on topics covered on the test; and 19 
percent said they decreased their emphasis on 
project work, since it was not directly assessed by 
the tcstP 

Research studies that focus in particular on 
teachers in districts with high-stakes testing condi- 
tions — such :^ MCT, school evaluation tests, or 
externally devel(^)ed course-end tests — demonstrate 
a greater influence of testing on curriculum and 
instruction. A study of four elementary classrooms 
with both mandated State and district objectives- 
based testing found that students spen' up to 18 
hours annually taking tests and about 54 hours 
receiving instruction that appeared to be directly 
oriented toward the tcsts.^ Ibachers of New York 
Regents courses, which have high-stakes testing at 
the end of the course, report spending anywhere 
j&om a few class periods to about 10 class periods 
(out of 175) reviewing and preparing for the exam- 
inations. Even the upper number reflects a rather 
modest direct effect of testtng.^^ 

One recent study, which sought to disentangle the 
effects of high-stakes testing on teaching and 
learning, showed fairly convincing evidence of 



^^H.D. Corbett and B. Wilson, "Unintended and Unwelcome: The Lo<:>a Impact of State Ibsting/* paper presented at the annual meeting of the 
American Educational Research Association, Boston, MA, April 1990, pp. 10-11. 

^Archbald and Porter, op. cit, footnote 31. Abo see ibid. 

^Daniel Koretz, Robett Linn, Ste|^ Dunbar, and Lorrie Shepard, ''The Effects of High Stakes Ibsting on Achievement: Preliminary Findings 
About Generalizations Across Ibsis,* * paper presented at the annual meeting of the American Educational Research Association, Chicago, IL, April 199 1 , 
p. 20. 

^Thomas Romberg^ Anne Zarrinia, and Steven Williams, The Influence of Mandated Testing on Mathematics Instruction: Grade 8 Thachers* 
Perceptions (Madison, WI: National Center for Research in Mathematical Science Education, University of Wisconsin-Madison, 1989), pp. 33-39. 
Nevertheless, the authots concluded that changes in instniction brought about by the tests were Incompatible with the kinds of changes sought by the 
mathematics community. See discussion below. 

^'See also Shepard, op. cit, footnote 6. 

Claire Rottenberg and Maiy jee Smith, * 'Unintended Effects of External Ibsting in Elementaiy Schools/ * paper presented at the annual meeting 
of the American Educational Research Association, Boston, MA, April, 1990. 

^'Douglas Archbold, ''Curriculum Control and Tbacher Autonomy,* * paper presented at the annual meeting of the American Educational Research 
Q Delation, Boston, MA, April 1990. 
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testing influencing teacher practices. This study 
found that: 

• teachers felt pressured to improve test scores; 
79 percent reported **great** or ** substantial** 
pressure by district administration and the 
media; 

• teachers reported giving greater emphasis to 
basic skills instruction than they would have in 
the absence of the mandatory tests; 

• one-half Uie teachers reported giving less em- 
phasis to subjects not on the tests; 

• one-half the teachers reported spending 4 or 
more weeks per year giving students work- 
sheets and practice exercises to review content 
they expected to be on the test and to prepare 
students for the tests: 68 percent of the teachers 
reported conducting these preparation activi- 
ties ''regularly/* i.e., throughout the school 
year and not just in the days or weeks prior to 
testing; and 

^ the majority of teachers could identify numer- 
ous beneficial uses of the tests, such as ' ' . . . set- 
ting instructional goals, providing feedback 
about student strengths and weaknesses, and 
identifying gaps in instruction . . . [but] these 
benefits . . . were offset or greatly outweighed 
by negative effects such as the amount of 
instmctional time given to test preparation, the 
amount of stress experienced, unfair or invalid 
comparisons, and the demoralizmg effects on 
teachers and students.* 

These findings on the effects of high-stakes 
testing on teacher behavior, which the authors of the 
study described above caution are not necessarily 
generalizable, raise fundamental questions about the 
use of tests for instructional reform. 

Misuse of MCI Data for School Comparisons 

Another lesson from the MCT experience is that 
if test data are available they will be used to make 
comparisons and judgments about districts, schools, 
and students regardless of the data's original pur- 
pose, the ways in which it was collected, or how 
many caveats are issued as warnings about potential 
misuse. These types of comparisons, furthermore, 
ignore differences between school districts with 
large variations in student populations, resources. 



and other factors affecting instruction; not only are 
the comparisons damaging to the self esteem of 
students and (schools, they are also potentially 
misleading to policymakers seeking information on 
how to improve the schools. 

Conclusions 

Viewing the MCT glass as at least half-full, 
proponents have argued for more high-stakes testing 
and, in particular, for more high-stakes testing that 
covers advanced skills. Their argument is simply 
that if it worked for the basic skills it can work for 
the higher order skills.^^ These supporters of high- 
stakes testing argue that MCT worked because it: 

• deHned a single performance standard tied to 
powerful incentives (promotion or graduation); 

• allowed teachers latitude in choosing whatever 
instructional methods they thought would be 
most appropriate to bring their students closer 
to the defined standards of performance; 

• signaled to students the importance of acquir- 
ing basic skills in order to become productive 
citizens in a democracy; and 

• conveyed to all students that they could acquire 
the necessary skills. 

Critics contend that MCT is not a genuine tool of 
reform because it: 

• does not provide school systems with informa- 
tion on to how to improve instruction, but rather 
serves to reinforce the instructional methods 
already in place; 

• ignores differences between school districts 
with large variations in student populations, 
resources, and other factors affecting instruc- 
tion; and 

• creates conditions under which true reform is 
not possible, by emphasizing test scores rather 
than improved learning. 

In the current debate over testing, it is common to 
hear both sides invoke the lessons of the minimum 
competency movement. Proponents focus on the 
powerful effects of high-stakes testing on clarifying 
and reinforcing curricula, and argue that once the 
right curricula are established tests will make them 
work. Critics fear that more high-stakes testing will 
reinforce outmoded curricula, provide misleading 
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^^or a detailed discussion of metlKKls, sample, and results, see Lome Shepard and Kaiherine Dougherty, *'Efifects of High Stakes Usting on 
Instruction," paper presented at the annual meeting of the American Educational Research Association, Chicago, IL, April 1991. 

^Lemcr, op. cit,, footnote 33. , 
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information to policymakers, and create artificial 
obstacles to educational and economic opportunity. 

The positive and negative lessons of M CT, and of 
100 years of prior experience with standardized 
tests, should inform policy for the future of testmg 
in America. Although some of the evidence is 
contradictory, even confusing, one thing is clear: 
test-based accountability is fraught with uncertain- 
ties—it is no panacea. Specific proposals for tests 
intended to catalyze school improvement must be 
scrutinized on their individual merits, with certain 
cautions in mind. First, the evidence seems clear that 
as the stakes attached to test results heat up, so do 
teacher and student efforts to do better on die tests, 
which can lead to instructional activities that do not 
necessarily promote real learning. Second, there is a 
compelling rationale to design high-stakes tests that: 
a) sharpen incentives for students and teachers to 
practice for them, but b) contain material worth 
practicing for. Experience to date suggests that 
designing such tests is harder than originally imag- 
ined and that none has yet been implemented 
successfully.^ Third, it is dubious that mandated 
testing alone has the potential to effect the sorts of 
restructuring needed to substantially reform educa- 
tion. 

Increased Concern About the 
Appropriate Use of Tests 

lasting policy in the United States has been 
influenced by the tugs of two countervailing tides: 
pressure for more testing with higher stakes on one 
hand, and cries for a slower pace and more careful 
examination of consequences on the other. As the 
influence of educational tests expanded in the 1970s 
and 1980s, a counterbalancing trend emerged. Indi- 
viduals with different int^ests — ^parents, students, 
scholars, lawyers, writers, civil libertarians — ^began 
questioning the role of tests in their own and others' 
lives and sounding alarms about the ef/ects of tests 
on individual privacy, equal opportunity, and fair- 
ness in the allocation of future opportunities. This 



antitesting movement encompassed a variety of 
sentiments, from skepticism about the validi^ of 
tests to apprehension about the damaging effects of 
their misuse. In addition, the trend gained momen- 
tum from the growth of consumerism and some key 
victories in Congress and the courts. The themes of 
this backlash against standardized testing, in the past 
and today, have tended to cluster around certain 
passior-inspiring issues: fairness, bias, due process, 
individual privacy, and disclosure. 

In the late 1960s, for example, the idea of a 
self-fulfilling prophecy" gained a foothold in the 
American consciousness, supported in part by a 
controversial study of teacher expectations. In this 
study, teachers were told that a test had identified a 
subset of children as ''bloomers** whose achieve- 
ment could be expected to flourish during the school 
year.^^ Despite die fact that these bloomers were 
actually chosen at random, many showed impressive 
gains, outpacing their ''nonbloomer'* classmates. 
This study, which has since been found to contain 
many weaknesses, caught the public fancy and 
helped to support the arguments of many lhat 
disadvantaged children were failing in school due to 
teachers' low expectations about their abilities. It 
also alerted the public to the potential dangers of 
labeling children on the basis of test scores, and thus 
limiting their educational futures.^ 

As this example illustrates, it is ni/i only the tests 
themselves that create controversy. Tbsting prac- 
tices and policies — the ways tests are used and the 
types of inferences drawn from them — also create 
many of the problems associated with testing. There 
is widespread agreement among educators, analysts, 
measurement experts, and test publishers that tests 
are often used for functions for which they were not 
designed or validated, and Uiat test results are often 
misinterpreted. 

What Constitutes Fair Testing Practice? 

Attempts to develop ethical and technical stand- 
ards for tests and testing practices have a long 



^The possibility that certain types of pcrfonnance assessments might solve the dilemma has generated enthusiastic research and experimentation. 
See ch, 6. 

^^Robcrt Rosenthal and Lenore Jacobson^ Pygmalion in the Classroom: Teacher Expectation and Pupils* Intellecmal Development (New York, NY: 
Holt, Rinebait and Winston, 1968). 

^For other sources on the self -fulfilling pr >phecy and rejoinders to the original study see Ray C. Rist, ' ' S tudent S ocial Class and Ib^cher Expectations: 
The Self-FulTiiling Prophecy in Ghetto Educaton/* Harvard Educational Review, vol. 40, No. 3, August 1970, pp. 41 M5 1; J.D. Blashoff and Richard 
E. Snow (cds.), Pygmalion Reconsidered (W«)rthington. OH: Jonrs, 1971); and Samuel S. Wincburg. •*Thc Self-FulfiUment of the Self-FulfiUing 
Prophecy: A Critical Appraisal, ' * and rcplys by Robert Rosenthal and Ray C. Rist, Educational Researcher, vol. 1 6, No. 9, December 1987, pp. 28-44. 
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history. These efforts have been made primarily by 
professional groups involved in the design and 
administration of tests, such as psychologists and 
educational measurement specialists. Although dis- 
cussions of such standards began at the tum of the 
century, the first organized efforts, at mid-century, 
resulted in the adoption of a formal code of ethics for 
psychologists in 1952 and a set of technical recom- 
mendations regarding test use developed by three 
professional groups in 1954.^'^ This latter document, 
known in its most recent version as the Standards for 
Educational and Psychological Testing (hereafter 
referred to as the Standards), has been revised three 
times in the intervening years.^ 

Some of these technical standards pertain to tests 
themselves: the methods by which diey should be 
developed, the data required to support their use, and 
evidence of their fairness. Although uimed primarily 
at the developers and publishers of tests, the 
standards have relevance for test users, who must 
evaluate the adequacy of the tests they buy or 
commission. 

Man / of the technical standards contain guide- 
lines for test use: appropriate procedures for the 
selection, administration, and interpretation of tests, 
and guidelines affecting the rights of test takers. The 
two incidents quoted below, for example, represent 
violations of principles of appropriate testing prac- 
tice. 

A high school newspaper carried a page one 
headline: ' 'Meet the g^uses of the incoming class' ' 
and listed all pupils of IQ 120 and up with numerical 
scores. Then under a heading: ''These are not 
geniuses, but good enough" were listed all the rest, 
with IQ scores down to the 60's. 

A new battery of tests for reading readiness was 
introduced in a school. Instead of the customary two 



or three, 12 beginners were this year described by the 
test as not ready for reading. They were placed in a 
special group and given no reading instmcticMi. The 
principal insisted that if the parents or anyone else 
tried to teach them to read ' 'Their little minds would 
crack under the strain. ' ' In at least two cases parents 
did teach them to read with normal progress in the 
first semester, and later mental tests showed IQ's 
above 120.<^^ 

As these examples suggest, one of the major 
problems with tite professional Standards is that 
most of the principal interpreters of educational test 
results (such as policymakers, school administra- 
tors, teachers, and journalists) are unaware of them 
and are untrained in appropriate test use and 
interpretation. 

A set of testing standards should consider the 
needs of three main participants in the testing 
process: 1) the test developer who constmcts and 
markets tests, 2) the test user (usually the institution 
that selects tests and uses them to make some 
decision), and 3) the test taker who takes the test 
**...by choice, direction^ or necessity/ Some 
form of consumer protection or assurance is needed 
for both the test user and the test taker, but 
particularly for the latter: . . who is still the least 
powerful of the three. ""^^ As depicted in figure 2-4, 
the test-taker's fate rests on the assumption that good 
testing practice has been up>ield by both the test 
developer when it constructed the test and the test 
user (such as the school) when it selected, inter- 
preted, and made a decision on the basis if the test. 
With few exceptions, the test taker has no direct 
contact with or access to the test developer; the test 
user serves as the primary filter through which 
testing information reaches the test taker.*^^ Just as 
the patient undergoing an electrocardiogram must 
assume that the machine is soundly built and 
correctly calibrated, that the technician is admini- 



67Tbe American Psychological AssociaUon, the American EducaUonal Research Association* and the National CouncU on Measurement in Education; 
and Walter Haney and George Madacs, "The Evolution of Ethical and Ibchnical Standards for Ibsting/* Handbook of V!stin$, R. Hambleton (ed.) 
(Amsterdam, The Netherlands: North-Holland Publishing Co., in press). 

«In 1966. 1974, and 1985. 

^American Psychological Association, quoted in Haney and Madaus, op. cit., footnote 67. 

'^Mdvin Novick, 'Tcdcral Guidelines and Professional Standards,** American Psycholosist, vol. 36. No. 10, October 1981, p. 1035. 

''*Jamcs V. Mitchell, Jr., "Tbsting and the Oscw Euros Lanumt: From Knowledge to Implementation to Use/* Social andTechnical issues in Testing: 
Implications for Test Construction and Usage Barbara S. Plake (cd.) (HiUsdale. NJf: L. Erlbaum Associates, 1984). 

T^For college and graduate admissions tesu such as the SAT, ACT, and GRE, test takers do have direct contact with test developers. On these tests, 
students register directly v^rith the test developers and receive explanations of the test, scoring methods, test-takii^g strategies, as weU as score reports 
from them. Records of test scores, in these cases, remain in the hands uf test developers, so privacy protection must also be assured by the developer. 
'•^ contrast, the rcponslbility for and control of the test-takers* scores remains with the school system for most educational achievement tests administered 
ro lling elementaiy and secondary years. 
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Figure 2*4— Appropriate Testing Practice In Education: Four Major Obligations of Ibst 
Developers and Test Users to Test Takers* 



Test developers 



(a) Provide 
information 
needed for 
users to select 
appropriate 
tests 



(b) Help users 
interpret scores 
correctly 



(c) Strive to make 
tests that t.e as 
fair as possible 



Test users 



(a) Select tests (b) Interpret (c) Select tests 



to meet intended 
purposes and 
appropriate for 
population being 
tested 



scores 

correctly 



that have been 
developed in 
ways that make 
them aa fair as 
possible 



Test takers 



(d) Inform test 
takers about test 
coveragei scoreSi and 
their interpretation! 
privacy, and other 
rights 



^This chart Is bas^d on Th0 Coch of Fair Tosthg PmctloQs in EdkKatlon which outlines four areas of miyor obligation 
to tast takers: 1) daveloping/Mlacting tasts, 2) Interpreting scores, 3) striving for fairness, and 4) Informing tast takers. 
Sea the Code for the specific prindpies In each area. 

NOTE: For lome kinds of tests, such as collega admlssk>ns tests, test developers have direct contact with test takers; 
In these cases, they are also obligated to the sat of principles (d) regarding appropriately Infonnlng test takers. 

SOURCE: Joint Committee on Testing Practkses, Cod» of Fair Testing PrBctlo$s In Education (Washington, DC: 
Natbnal Coundl on Measurement in Education, igSS). 



stering the test properly, and that the physician is 
interpreting the information appropriately, so must 
the test taker assume that the choice of test, its 
method of administration, and its interpretation are 
correct. Currently, few mechanisms exist to assure 
such protection for educational tests. 

The assurance of good testing pracace for the test 
taker is further complicated by the absence of 
information about tests. Ibsting manuals, v^hich 
document development and validation processes, 
are highly technical, and considerable training is 
required to evaluate the statistical properties of 
much of this test data. In addition, most tests are 
closely supervised by developers and users, in order 
to maintain the secrecy of test items, which is 
important to assuring that the test remains fair for all 
current and future test takers.''^ The compulsory 
nature of most schoolwide testing programs presents 



yet another complication: students and their parents 
can exercise little choice about whether a child 
should be tested. In sum, a social and ethical tension 
exists between the need for close professional 
supervision of tests and the need for open public 
discussion and knowledge about tests by test takers — 
especially those whose educational opportunities 
may be ^ected by their use. 

Since the 1977 version of the Standards, more 
attention has been given to the rights of the persons 
being tested. This attention to consumers* rights, 
however, appears to conflict somewhat with the 
need for test security. For example^ 

Concerning testing, the 1977 Standards states that 
''Persons examined have the right to know results, 
the interpretations made, and where appropriate the 
original data m which final judgenients were 
made.'' In light of the very next sentence, the 
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^^In fact the ethical principles of psychologists prohibit them from releasing tests to imquali^ed persons; dissemimtion of any standardized test risks 
didatiug the test and giving some test takers an unfair advantage over others. 
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modifier ''where appropriate** looms large and 
uncertain: '"Ifest usf^rs avoid imparting unnecessary 
information which would comprise test security. . . .** 
An obvious question remains: When do the rights of 
test takers leave off and the need for test security 
begin?^^ 

Agreement about what constitutes good testing 
practice is far from unanimous even among profes- 
sionals; as the above example suggests, considerable 
latitude of interpretation is allowed for any one of 
the standards. For the most part each standard is a 
general principle, a goal to strive for and uphold; the 
specific criteria by which it is met are not explicitly 
stated. The principles governing the appropriate 
administration of standardized achievement tests in 
schools are a good example. What one school district 
may call legitimate test preparation activities (prac- 
tice, coaching, and explanation of instructions to 
students), another may deem dubious or even 
unethical. These different interpretations are one of 
the principal causes of test score * •inflation.* '-^^ 

Recently some professional groups have been 
working to translate the more technical Standards 
into principles for untrained users of tests, such as 
administrators, policymakers, and teachers. The 
Code of Fair Testing Practices in Education^^ (for 
basic provisions, see figure 2-4) attempts to outline 
the major obligations that professionals who use or 
develop educational tests have to individual test 
takers. These principles are widely agreed on and 
endorsed by professional groups as central to the fair 
and effective use of tests.-^-^ 

What agreement is there about the rights of test 
takers? Is there a consistent set of ethical principles 
that should be followed? Most professional groups 
seem to agree that test takers should be provided 
with certain basic information about: 



• content covered by the test and type of question 
formats; 

• the kind of preparation the test taker should 
have and appropriate test-taking strategies to 
use (e.g., should they guess or not?); 

• the uses to which test data will be put; 

• the persons who will have access to test scores 
and the circumstances under which test scores 
V ill be released to anyone beyond those who 
he ve such access; 

• the length of time test scores will be kept on 
record; 

• available options for retesting, rescoring or 
cancelling scores; and 

• the procedures test takers and their parents or 
guardians may use to register complaints and 
have problems rcsolved.-^^ 

An important question arises regarding the princi- 
ple of * 'informed consent,'* defmed by the Stand- 
ards as: 

The granting of consent by the test taker to be 
tested on the basis of full information concerning the 
purpose of the testing, the persons who may receive 
the test scores, the use to which the test score may be 
put, and such other information as may be material 
to the consent process. '^ 

Since most children cannot give truly informed 
consent, an adult serving as a proxy must give 
consent. Although in most cases such a proxy will be 
the parent, there appears to be certain circumstances 
under which school officials are allowed to grant 
permission for collecting and using pupil informa- 
tion. Currently, the Standards suggest diat test data 
collected on a schoolwide basis or by a legislated 
requirement are exempt from parental informed 



^^Haney anu Madaus, op. cit., footnote 67, p. 6. 

75Sce, e.g.. Thomas M. Haladyna, Susan Bobbit Nolcn, and Nancy S. Haas, * 'Raising Standardized Achievement Tfcst Scores and the Origins of Tbst 
Score Pollution,** Educational Researcher, vol. 20. No. 5. June-July 1991, pp. 2-7. 

^^Aulbored by the Joint Committee on Tfcsting Practices initiated by the American Educational Research Association, the American Psychological 
Association, and the National Council on Measurement in Education in 1988. Joint Committee on Tbsting Practices, Code of Fair Testing Practices in 
Education (Washington. DC: National Council on Measurement in Education, 1988). 

^Similar efforts are under way in other countries. For example, a number of professional groups in Canada, drawing on the experience of the Joint 
Committee who developed the Code, have begun working on a set of principles for Canadian testing programs. 

^•Sec, e.g., American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. 
Standards forEducaHonal and Psychological TesHng (Washington, DC: 1985); Joint Committee on Tbsting Practices, op. cit.. footnote 76; Russell Sage 
Foundation, Guidelines for the Collection, Maintenance, and Dissemination of Pupil Records (New York. NY; 1969); U.S. Department of EducaUon, 
Office of Educational Research and Improvement, Your Child and listing (Washington, DC: 1980). 
9^"^American Educational Research Association et al.. op. cit.. footnote 78. pp. 91-92. 
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consent— consent is given in this case by school 
officials*^ 

Informed consent also implies that the test takers 
are aware that they are being tested* As high-stakes 
tests are now conducted, children are certainly well 
aware that they are being tested: instructions, 
setting, and testing booklets all serve to clearly mark 
the testing session as some'liing different from the 
everyday business of the classroom. Parents and 
children are usually notified in advance when tests 
will be given, in part so that parents can assure that 
their children are well rested and fed on testing day. 
Conditions and circumstances of testing are made 
clear so that all children have the chaiice to do their 
best. 

How can parents be assured that tests are being 
used appropriately by schools to make decisions, 
particularly about individual students? One of the 
persistent problems with tests is that they are used 
for purposes not originally intended. Those being 
tested are not always directly informed about the 
uses and purposes of testing. Although it has long 
been considered to be the ethical responsibility of 
test administrators and devebpers to assure that tests 
are used only for purposes intended, there are few, if 
any, safeguards to assure this. Furthermore there are 
even fewer protections for the test score information 
once it is obtained—scOTes that sit in a child's record 
can be used by anyone who has access to that record 
whether or not that person toiows anything about the 
particular test that was administered. It is difficult to 
prevent the misuse of test-bascd information once 
that information has been collected. 

How is Fair Testing Practice Encouraged 
or Enforced? 

It follows from tliis analysis that the frrst step 
toward fair testing practice is agreement on a set of 
principles or guidelines about appropriate and inap- 
propriate test practices. Achieving such a consensus 
is not always a simple or clear-cut process. But given 
that some agreement already exists about what 



constitutes appropiiate and inappropriate test use, 
how can these practices be encouraged or enforced 
and unfair practices be discouraged? 

Right now there are four mechanisms for encour- 
aging fair and appropriate testing practices: profes- 
sional >self-regulation^ education, litigation, and 
legislation. 

Professional Self-Regulation 

Professional self-regulation is the primiary mecha- 
nism for promoting good testing practices in educa- 
tion. Standards and codes for testing developed by 
professional associations, critical reviews of tests by 
experts, and incavidual professional codes of ethics 
all contribute to better testing practices among 
testing professionals; nevertheless, many profes- 
sionals agree that these codes lack sufficiently 
strong enforcement mechanisms,^* The Buros Insti- 
tute of Mental Measurement has long been con- 
cerned with the education of test users and the 
assurance of quality tests. As part of these efforts the 
Institute publishes the Mental Measurement Year- 
book (MMY), first published in 1938, which con- 
tains critical reviews by experts of nearly all 
commercially available psychological and educa- 
tional tests. Recently, Institute personnel concluded 
that 41 percent of the tests reviewed in The Eighth 
Mental Measurements Yearbook were lacking in 
reliability and/or validity data,®^ jjj y^^^ before 
his death, Oscar Buros often lamented the lack of 
effect that either the Standards or the Buros Institute 
had on test quality or use. In a speech in 1968, for 
example, Buros reported the following: 

At present, no matter how poor a test may be, if 
it is nicely packaged and if it promises to do all sorts 
of things which no test can do, the test will find many 
gullible buyers. Wh^ we initiated critical test 
reviewing in The 1938 Yearbook, we had no idea 
how difficult it would be to discourage the use of 
poorly constmcted tests of unknown validity. Even 
the better informed test users who finally become 
convinced that a widely used test had no validity 
after all are likely to msh to use a new instrument 



^rhe Standards read: . . infoitned consent should be obtained from lest takers or their legal representatives before testing is done except (a) when 
testing without consent is mandated by law or governmental regulation (e.f statewide testiiig programs); (b) when testing is conducted as a regular part 
of school activities (e.g., schoolwide tc^tio.?; ptogrtims and participation by schoob in nonning and research studies); or (c) when consent is clearly 
implied (e f application for empiu/meni or educational admissions).** Ibid., p. 85. 

»*Sce. w-g., George Madaus, * 'Public Policy and the Tfcsting Protession— You*ve Never Had it so Good?* * and reactions by former National Couocil 
on Measurement in Education prcsldcnte William E. Cof&nan, Thomas J, FiUgibbon, Jason MiUman. and Lorrie A. Shcpa/d. in Educational 
Measurement: Issues and Practice, i;/inler 1985, pp. 5-16. 

Q ^^Mitchell, op. clt.. footoote 71. 
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which promises far more than any good test can 
possibly deliver.^ 

In addition, the efforts by professionals to self- 
regulate are often aimed at developing technically 
sound tests and thus at the transactions between test 
developers and test users. Less attention has been 
directed toward the even more in^ctable problem 
of how to assure that tests are used appropriately 
once developed and chosen by a school. How can 
good testing policies be assured once a testing 
program, over which test takers have no choice 
about participation, is put in place? 

Education and Public Discussion 

Education and public discussion about tests, their 
limitations (as well as their value), and the principles 
of appropriate test use is the second way better 
testing practices could be encouraged. If the general 
public, parents, and test takers understood what 
questions to ask about tests and what protections to 
expect, then those who administer and choose tests 
would be more accountable for their testing prac- 
tices. A number of testing experts believe that more 
open examination of test use and its social conse- 
quences could help encourage better practices on the 
part of those responsible for adininistering and 
interpreting tests.^ 

Ibachers, principals, school boards, superinten- 
dents, and others who set testing policies for schools 
are another audience for educational efforts. Some 
proposals have recommended mandatory training 
for teachers to help them better understand tests and 
good testing practices.^^ Recently several profes- 
sional associations jointly drew up a set of **Stand- 
ards for Teacher Competence in Educational Assess- 
ment of Students,'' which established guidelines for 
what teachers should know in order to use various 
assessment techniques appropriately.^^ Others have 



called for better training of administrators and have 
encouraged rewarding of administrators for good 
assessment practices in their schools.^'' 

Litigation 

Litigation is the third route toward better testing 
practice. •^Before the 1960's, the courts were rarely 
concerned with testing or evaluation of students. 
Most likely^ their concem was limited because, 
under the standard of 'reasonableness,' standardized 
testing was a subject left principally to the profes- 
sional discretion of school teachers and administra- 
tors. ' And since the courts showed little interest in 
test-related issues, as characterized in this quotation, 
lawyers had no incentive to bring legal actions about 
testing practices. 

As the use of tests increased, so did their potential 
for causing legally significant harm to test takers.^^ 
The court's **hands off' approach changed in the 
1970s and 1980s, with the filing of several lawsuits 
challenging the uses of standardized tests in educa- 
tion. The activism of parents, civil rights advocates, 
and civil liberties groups was an important spur to 
the development of case law in this area. Overall, 
however, educational tests have received far fewer 
legal challenges than have employment-related 
tests.^ 

Most litigation involving standardized educa- 
tional tests involves individuals who, alone or as a 
class, claim violations of fundamental rights. These 
include the constitutional rights of due process and 
equal protection, and the rights guaranteed by 
Federal laws, such as civil rights, equal opportunity, 
and education of individuals with disabilities. The 
issues tend to center on the use of tests for 
classification, exclusion, and tracking, or the privacy 
of individual test takers. In these cases, the defen- 
dants are usually State and local school administra- 



MQscar K. Euros. ''Tlic Story Behind the Mental Measurements Yribooks/* Measurement and EvaluaHon in Guidance, vol. 1. 1968. p. W. 

^Mitchell op. cit. footnote 71; and Walter Haney. ••Tbsting Reasoning and Reasoning About Tfcsting;* Review of Educational Research, vol 54. 
No. 4, winter 1984. pp. 597^^54. 

wjohnR. Hilb. -Apathy Concerning Grading andT^^ 
Literacy. ' ' Phi Delta KapfL, vol. 72!no. 7. March 1991 . pp. 534-539; and Robert Lynn Canady and Phyllis RUey HotchLf s. • It s a Good Score! Just 
a Bad Grade.** Phi Delt^ Kavpan, vol. 71. No. 1. September 1989, pp. 68-73. 

WAmerican FfedcraUon of Tfeachew. National Council on Measurement in Education, and NaUonal EducaUon Association. -Standards for Tfcachcr 
Competence in Educational Assessment of Students.*' unpublished document. 1990. 

STHilte. op. cit.. footnote 85. 

Mjamcs E. Bruno and John C. Hogan. -What Public Interest Uwyers and EducaUonal Policymakers Need tc now About Tbsting: A Review of 
Recent Cases. Laws and Areas of Future Litigation.** WhitHer Law Review, vol. 7. No. 4. 1985. p. 917. 

•n>onald N, Bcrsoff. -Social and Legal Influences on Tbst Development and Usige.** in Plake (ed.). op. cit. footnote 71 . 
Q wsee Wigdor and Gamer (cds.). op. cit.. footnote 29. for an overview of legal issues in employment and educational testing. 
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tors. Some of the earliest challenges to testing 
practices focused on racial discrimination. Under 
attack were certain classification and tracking poli- 
cies — not uncommon in Southem schools resisting 
desegregation— that used I.Q. and other tests in 
ways that resulted in resegregation. Federal courts 
quickly barred tliese types of programs.^^ 

Often it is the testing policy or the way a test is 
being used, rather than the test itself, that is 
challenged ir court. In addition, most legal chal- 
lenges have dealt with tests used for the so-caUed 
**gatekecpmg** functions: college admissions, mini- 
mum competency, or special education placement. 
Thus, tests are most likely to receive legal scrutiny 
and challenge when they are used to make signifi- 
cant decisions about individual students. In general, 
the courts have most often sought guidance from and 
upheld the Standards. 

Some of the most significant cases involving due 
process and testing were spawned by the minimum 
competency movement. The first such case, the 
landmark Debra P. v. TUrlington, claimed that the 
Florida law requiring students to pass a functional 
literacy test before obtaining a high school diploma 
violated the student plaintiffs^ rights to due process 
and equal protection, as well as the Equal Educa- 
tional Opportunities Act. After examining such 
issues as whether the test assessed skills that were 
actually taught, whether there was adequate notice 
of the requirement, whether students had access to 
adequate remediation, and whether they had oppor- 
tunities to take the test ova*, the court enjoined 
Florida from implementing the law until 1982-83, 
afttr the vestiges of the State's formerly segregated 
school system were presumed to have dissipated. 

As m other cases, the court referred to the 
Standards in reaching its decision. However, this 
case also demonstrated quite clearly the consider- 
able latitude for interpretation and professional 
judgment required to translate the Standards into 
specific recommendations for practice. During the 
trial, two testing experts, both of whom were 
members of the committee who drew up th^. 
Standards in 1974, offered divergent and conflicting 
expert views about the kind of validity evidence the 
State of Florida should have provided.^ 
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Resorting to the oourts to settle Issues of good testing 
practice Is often a last recourse. Most legal challenges to 
educational tests have occurred when these tests have 
been used for selection, certification, or placement 
of students. 

The body of case law reveals some broad themes 
about how courts view tests, and some general 
principles about acceptable and unacceptable uses of 
tests. In general, courts have a great respect for 
well-constructed, standardized tests that are clearly 
tied to the curriculum. They do not find them 
arbitrary or irrelevant to the legitimate State interest 
in improving education. A minimum competency 
test, for example, is a reasonable method of assess- 
ing students' basic skills. In addition. Federal courts 
have hesitated to interfere in the education process 
or second guess local school district personnel. 

Courts tend to look at how the results of the tests 
are used. If there are allegations that tests v/crt used 
to deny graduation diplomas, place students in lower 
education tracks, or misclassify students as mentally 
disabled — any situations in which a test taker can 
claim serious injury — then the cases will be given 
more careful scrutiny. Cases involving historically 
vuhierable groups of students, such as minorities and 
children with disabilities, also raise flags. 



^n^^^^"^.^'^"^* -Tbstiiig in Elementary 

AttocaHon: The Workplace and the Law, Bernard R Gifford (cd.) (Boston. MA: Kluwcr. 1989). ^Pi^n^fuiy 
Q ^Scc Haney, op. cit., footnote 84. 
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Usually Federal lawsuits involving the use of tests 
have been successful only where there was a claim 
that the test violated some other, independently 
established Federal right, such as the ri^t of due 
process or protection from racial discrimination.^^ 
State courts have shown similar deference to local 
judgment* 

Court decisions have established some other basic 
guidelines about tests and their applications. Tfests 
should accurately reflect their intended content. 
Students should have opportunities to leam the 
material on the tests in school. Students should 
receive adequate notice to prepare ifor the tests. The 
examinations should not be used as the sole factor in 
determining placement or status. The scoring proce- 
dures should accurately assess mastery of the 
content.^ 

Courts have protected the privacy of the parent- 
child relationship when testing of a very personal 
natiire, such as certain psychological and diagnostic 
tests, has interfered with family relationships or the 
parents' rights to rear their children. On the flip side, 
courts have also tended to protect the security of 
tests by reaffirming the applicability of copyright 
laws to test materials. 

Resorting to the courts to settle issues of good 
testing practice is often a last recourse. However, 
many testing experts as well as educators feel that 
courts are not ^e optimal arena in which to set 
policies regarding tests and their use. '*If educators 
have a difficult time matching students with appro- 
priate educational placements, judges have no expe- 
rience at all.*'^^ 

One clear alternative; tc courts as watchdogs is to 
encourage school systems and policymakers to be 
more careful about the testing policies they imple- 
ment. Many school testing policies are not set 
clearly and explicitly nor are they publicly available. 
As one litigator, involved for many years in testing 
and tracking litigation in schools, has written: 

. . the most difficult part of such litigation is the 
process of factual investigation to determine exactly 



what use is being made of what tests in a particular 
district:'^ 

A recent case in New York State suggests that 
educational adniinistrators may have an important 
role to play in providing guidance and supervision 
regarding the fairness of school testing policies. The 
I other of an eighth grade student who had been 
e..viuded from enrichment programs because of her 
test scores on the Iowa Tfests of Basic Skills (TTBS) 
appealed that decision. The district superintendent 
denied her appeal, supporting the school board's 
policy of using this test as the screening criteria for 
the enrichment program. This mother then j^pealed 
her case to the New York State Commissioner of 
Education who, after reviewing the evidence about 
the ITBS, issued an order prohibiting the district's 
use of test scores as the sole determinant for 
eligibility for educational enrichment programs. In 
part the order reads: 

GivCT the proviso in the UBS testing manual, 
respondents' use of its test scores as a screening 
device tliat automatically excludes a student from 
further consideration for placement in an enrichment 
program is inconsistent with the specific guidelines 
provided by the developers of the ITBS test. 
Furthermore, because the results of a single lest may 
be adversely affected by factors such as anxiety, 
illness, tesi-taking ability, ability to process direc- 
tions or general distractibility (which have little to do 
with ability or achievement), use of standardized test 
scores as a screening device may serve to exclude 
pupils prematurely who are otherwise eligible. 
Based on the foregoing, I conclude that respon- 
dents'(the district) policy which denies a student the 
possibility of further consideration for placement in 
an enrichment program solely on the student's 
failure to achieve above a certain score on a subpart 
of the ITBS is not a legitimate measure for screening 
a student's capacity for success in an enriched 
program and is, therefore, arbitrary, capricious and 
contrary to sound educational policy?'^ 

As the attorney cited above notes: 

As we (litigators) accumulate more knowledge 
about both test construction test misuse in 



93Chachkin, op. cit., footnote 91. 
^Bnmo and Hogan« op. cit, footnote 88. 

93Wimam H. Clune, •Touits as CauUous Watchdogs: ConsUtutlonal and PoUcy Issues of Standardized Usting in Education.^' report prepared for 
the National Commission on Tbstlng and Public Policy, 1988, p. 1. 
9H:hachldn, op. cit., footnote 91, p. 186, emphasis added. 

border #12433 of the State Education Department of New York, issued Dec. 7, 1990by Thomas Sobol, Commissioner of Education, p. 3. emphasis 
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educational settings, it will become easier for 
attorneys to gather these facts and litigation will 
continue and expand For this reason, policymakers, 
legislators, and educational administrators are well 
advised to conduct their own reviews for the purpose 
of restricting test use to appropriate functicxis within 
their institutions and systems.^^ 

Federal Legislation 

Federal legislation is the fourth avenue to im- 
proved test practice. Some of the practices common- 
place today in educational testing are the result of 
legislative efforts. In the mid-1970s, Congress 
passed a series of laws with sigiiificant provisions 
regarding testing and assessment, one affecting all 
students and parents and the others affecting individ- 
uals with disabilities and their parents. In both cases, 
this Federal legislation has had far-reaching implica- 
tions for school policy because Federal financial 
assistance to schools has been tied to compliance 
with these legislated mandates regarding appropri- 
ate testing practices. 

The Family Education Rights and Privacy Act 
of 1974 (FERPA)— FERPA, commonly called the 
••Buckley Amendment" after former New York 
Senator James Buckley, was enacted in part to 
attempt to safeguard parents' rights and to correct 
some of the improprieties in the collection and 
maintenance of pupil records. This ktgislation drew 
heavily on a set of voluntary guidelines regarding 
pupil records, called the Russell Sage Foundation 
Conference Guidelines, drawn up in 1969 by a panel 
of education professors, school administrators, sociol- 
ogists, psychologists, professors of law, and a 
juvenile court judge.^^ The basic provisions of this 
legislation are twofold. First it establishes the right 
of parents to inspect school records. Second, it 
p. atects the confidentiality of information by limit- 
ing access to school records (including test scores) 
to those who have legitimate educational needs for 
the inft mation and by requiring written parental 
consent fc. " the release of identifiable data (see table 
2-3). 



Table 2*3— Federally Legislated Rights Regarding 
Testing and School Records 



I. The Family Education Rights and Privacy Act of 1974 

A. Right to Inspect records: 

1 . Right to see aH of a cM Id's test results that are part 6f the 
child's off Ida! school record. 

2. Right to have test results explained. 

3. Written requests to see test results must be honored In 45 
days. 

4. If child Is over 18, only the child has the right to the record. 

B. Right to privacy: Rights here limit access to the official school 
records (Including test scores) to those who have legltln^te 
educational needs. 

II. The Education of All Handicapped Children Act of 1975 
and Ihe Handicapped Rehabilitation Act of 1973 

A. Right to parent Involvement: 

1. The first time a child is considered for special education 
placement, the parents must be given written notice In 
their native language, and their permission must be 
obtained to test the child. 

2. Right to challenge the accuracy of test scores used to plan 
the child's program. 

3. Right to file a written request to have the child tested by 
other than the school staff. 

4. Right to request a hearing If not satisfied with the school's 
decision as to what are the t>est services for the child. 

B. Right to fairness In testing: 

1. Right of the child to be tested In the language spoken at 
home. 

2. Tests given for placennent cannot discriminate on the 
basis of race, sex, or socioeconomic status. The tests 
cannot be culturally biased. 

3. Right of child to be tested with a test that rneeis special 
nedds (e.g.. Braille or orally). 

4. No single test score can be used to make special 
education placement dedstons. Right to be tested In 

several different ways. 

SOURCE: E.B. Horndon. Your CNIdand TBsUng (Washington, DC: U.S. 
Department of Education. National Institute of Education, 
October 1980), pp. 26-27. 



FERPA was an early victory for the proponents of 
public disclosure of test results and to date their only 
significant success in the Federal arena. During the 
1980s, several •'truth in testing** bills weie intro- 
duced in Congress, intended to make tests more 
accessible to individuals who took them. Amid press 
reports about serious scoring mistakes and the 
publication of books accusing major testing compa- 
nies of greed and arrogance, these bills gained 
momentum for a while, but none were enacted. The 



9«Chaclikiii, op. cit., footnote 91. p. 186. 

99Wilh respect to * 'informed consent,* * the Russell Sage Foundation ConfetenctGuideiines, op. cit., footnote 78, stale that: * *. . . no infonnation should 
be collected from students without the prior infonned consent of the child and his parents,** p. 16. However, these guidelines also specify tbc types of 
data for which the notion of representational consent can be accepted. Representational consent means that permission to collect data is given by 
i^propriately elected officials, such as the State Legislature or local school board. The Guidelines go on to clarify that: * 'no statement of consent, whether 
individual or representational, should be binding unless it is freely given after Tlte parents (and smdents where appropriate. . .) have been fully informed, 
preferably in writing, as to the methods by which die hiformaUon will be colle;ted; the uses to which it would be put; the methods by which it will be 
recorded and maintained; the time period for which it will be retained; and the peisons to whom it will be available, and underwhat conditions.** 
-.17. 
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drive for Federal action io ensure better testing 
practices has since stalled. 

These bills were patterned, to some extent, on 
legislation passed by New York and California 
requiring testing cinnpanies to disclose to State 
commissions information about tests and testing 
procedures, as well as the answers to test questions. 
In general these laws have contained three main 
provisions: 1) that test developers file information 
about the reliability and validity of the test with a 
government agency, 2) that they infomi students 
what their scores mean, how scores will be used and 
how access to the scores will be ccmtroUed, and 3) 
that individual test takers have access to corrected 
questions (after the test), not just the score they 
receive. It is largely this third provision that has 
made this type of legislation so controversial; the 
first two provisions (assuring access to information 
about the test's development and assuring that the 
test taker is s^prq)riately informed and privacy 
protected) are basic tenets of good testing prac- 
tice.^^ llie premise behind these laws is that by 
increasing public scrutiny of tests, their develop- 
ment and their uses, potential harm to individuals 
can be headed off in the early stages — as when a 
testing company makes a scoring error — and the 
tests themselves will beccnne more accurate and fair. 

Legislation Affecting Individuals With Dis- 
abilities—The Rehabilitation Act of 1973 bars 
recipients of Federal fimds from discriminating 
against individuals with disabilities. In the educa- 
tional arena, the act has been interpreted to protect 
against misclassification of people as retarded, 
learning disabled, or mentally disabled in other 
ways. 

One of the most consistent reconunendations of 
testing experts is that a test score should never be 
used as the single criterion on which to base 
decisions about individuals. Significant legal chal- 
lenges to the overreliance on I.Q. test scores in 
special education placements led to an exemplary 
Federal policy on test use in special educai >n 
decisions. The Education for All Handicj^ped 



Children Act of 1975 (FubUc Law 94-142) was 
designed to assure the rights of individuals with 
disabilities to the best possible education. Congress 
included eight provisions designed to protect stu- 
dents and ensure fair, equitable, and nondiscrimina- 
tory use of tests in implementing this program. 
Among the provisions were: 1) decisions about 
students are to be based on more than performance 
on a single test, 2) tests must be validated for the 
purpose for which they are used, 3) children must be 
assessed in all areas related to a specific or suspected 
disability, and 4) eva* v^tions should be made by a 
multidisciplinary team.^^^ This legislation provides, 
then, a number of significant safeguards against the 
simplistic or capricious use of test scores in making 
educational decisions. 

Conclusion: Toward Fair Testing Practice 

Lega^. challenges have affected testing practices in 
some in^rtant ways. First, they have . made the 
[psychological and testing] profession, as well as 
society in general, more sensitive to racial and 
cultural differences and to how apparently innocent 
and benign practices may perpetuate discrimination. 
[Second, they have] . . . alerted psychologists to the 
fact that they will be held responsible for their 
conduct. ' ' Third, by drawing some attention to the 
rights of test takers and responsibilities of test 
administrators, they have accelerated the search for 
better means of assessing human competencies in all 
spheres. 

Even after the enactment of FERPA and 25 years 
of court challenges, the current level of protection 
against test misuse remains rather low when com- 
pared with some other areas of consumer interest. 
Protections consist primarily of warnings in test 
publishers' manuals and a handful of State laws. 
Few public school districts, except for the very 
largest, have staffs with adequate backgrounds in 
psychometrics, fully trained in professional ethics 
and responsibilities governing test use and misuse. 
For most school systems, there is an abundance of 
public and government pressure to test students 
extensively, but a minimum of support to help them 



i^'^^TbetiuthinU'stiiiglegisladonhasfocusedpii^^ . .probably in part because such tests seem to have 

more visible consequeiices for the fate of individual test-takers than did testing of students below the college age* but surety also because college age 
test-takers had consid^^^ably more political clout than test-takers too young to vote.** Mehrens and Lehmann« op. cit.» footnote 22* p. 629; and Haney» 
op. cit., footnote 84. 

John Salvia and James E. Ysseldyke* Assessment in Special and Remedial Education, 3rd ed. (Boston, MA: Houghton Mifflin Co.» 198S). 
>02i>onald N. Bcrsoff. ^'Tbsling and the Law/* American Psychologist, vol. 36» No. 10» October 1981. p. 1055. 
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make . .proper, cautious interpretations of the 
data which are produced/**^ 

As educational test use expands, examination of 
the social consequences of test use on children and 
schools must also be a priority. More social dialog 
and openness about what constitutes acceptable and 
unacceptable testing practices should be encour- 
aged. Furthermore, tests used for the gatekeeping 



functions of selection, placement, and certification 
should be very carefully examined and their social 
consequences considered. If high-stakes testing 
spreads into new realms, such as a national test, we 
can expect to see the number of court challenges and 
the demand for legislative and regulatory safeguards 
multiply. Options for Congress to consider to foster 
better testing practice are discussed in chapter h 



Q »04Qi^hkin, op. cil„ footoole 91 . 
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Highlights 

• As the Federal financial commitmeiit to education expanded during the 1960s and 1970s, new demands 
for test-based accountability emerged. Federal policymakas now rely on standardized tests to assess 
the effectiveness of several Federal programs. 

• Evaluation requirements in the Federal Ch^ter 1 program for disadvantaged children, which result in 
more than 1.5 million children being tested every year, have helped escalate die amount of testing in 
American schools. Questions arise about wheUur results of Chester 1 testing produce an accurate 
picnire of the program's effectiveness, about the burden that the testing creates for schools, teachers^ 
and children, and about die usefulness of the infcnmatiim provided by the test results. 

• The National Assessment of Educational Progress (NAEP) is a unique Federal effort begun in the 
19608 to provide long-temi and continuous data on the achievement of American school children in 
many different subjects. NAEP has become a well-respected instrument to help gauge the Nation's 
educatiimal health. Recent proposals to change NA^ to allow for comparisons in performance 
between States, to establish proficiency standards, or to use NAEP items as a basis for a system of 
national examinations raise questions about how much N<\EP can be changed without compromising 
its original purposes. 

• National testing is a critical issue before Congress today. Many questions remain about the objectives, 
content, format, cost, and administration of any national test. 



The role of the Federal Government in educa- 
tional testing policy has been limited but influential. 
Given the decentrr'ized structure of American 
schooling, few decisions supported with test infor- 
mation are made at the Federal level. States and local 
school districts make most of the decisions about 
which tests to give, when to give them, and how to 
use the information. The Federal Government 
weighs in primarily by requking test-based meas- 
ures of effectiveness for some of the education 
programs it funds, operating its own testing program 
through the National Assessment of Educational 
Progress (NAEP), and affording some limited pro- 
tections and rights to test takers and tiieir parents 
(see ch. 2). 

This circumscribed Federal role has nevertheless 
influenced the quantity and character of testing in 
American schools. As IPederal funding has expanded 
over the past 25 years, so has the Federal appetite for 
test-based evidence aimed at ensuring accountabil- 
ity for those funds. This growtii in Federal influence 
has evolved witii no specific and deUberate Federal 
poUcy on testing. Most Federal decisions about 
testing have been made in the context of larger 
program reauUiorization bills, witii evaluation ques- 

ERIC 



tions treated as program issues rather than testing 
policy issues. As discussed in the preceding chapter, 
1 Congress did consider several bills in the 1970s and 
1980s related to test disclosure and the rights of test 
[ takers; only the Family Education Rights and 
Privacy Act of 1974 became law. 

This picture is changing. Congress now faces 
several critical choices that could redefine the 
^ Federal role in educational testing. In three policy 
areas, Congress has aheady played an important 
role, and its decisions in the near term could have 
significant consequences for the quantity and quality 
of educational testing. Accountability for federally 
funded programs is the first area. The traiiition of 
achievement testing as a way to hold State- or 
district*level education authorities accountable is as 
i old as public schooling itself. Continued spending 
on compensatory education has become increas- 
ingly dependent on evidence that these programs are 
working. Thus, for several decades now the single 
largest Federal education program—Chapter 1 (Com- 
pensatory Education) — has struggled with the need 
for evaluation data from States and districts that 
receive Federal monies. Increasing reliance on 
standardized norm-referenced achievement tests to 

"-81- J > 
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monitor Chapter 1 programs indicates an increasing 
Federal influence on the nature and quantity of 
testing. Congress has revised its accountability 
requirements on several occasions, and in today 
atmosphere of test reform, the $6 billion Federal 
Chapter 1 program can hardly be ignored. The basic 
policy question is whether the Federal Government 
is well served by the information derived from the 
tests used today and whether modifications could 
provide improved information* 

Second, Federal support for collection of educa- 
tional data, traditionally intended to keep the Nation 
informed about overall educational progress, is now 
viewed by some as a lever to influence teaching and 
learning. Thus, the 20-year-old NAEP, widely ac- 
claimed as an invaluable instrument to gauge the 
Nation's educational health, has, in the past few 
years, attracted the attention of some policymakers 
interested in using its tests to change the structure 
and content of schooling. 

A third and related issue is national testing. In 
addition to various suggested changes to NAEP, a 
number of proposals have emerged recently — from 
the White House, various agencies of the executive 
branch, and blue ribbon commissions — to imple- 
ment nationwide tests. Although the purposes of 
these tests vary, it is clear they are intended to bring 
about improved achievement, not simply to estimate 
current levels of leaming. The idea of national 
testing seems to have gained greater public accepta- 
bility. Proponents argue that "national** does not 
equal "Federal,** and that national education stand- 
ards do not require Federal determination of curric- 
ula and design of tests. Others fear that national 
testing will lead inevitably to Federal control of 
education. 

OTA analyzed the development and effects of the 
current Federal role in testing and examined pending 
proposals to change that role. This chapter discusses 
OTA*s tindings vis-d-vis Chapter 1, NAEP, and 
national testing. 



Chapter 1^ Elementary and Secondary 
Education Act: A Lever on Testing 

The passage of the 1965 Federal Elementary and 
Secondary Education Act (ESEA) heralded a new 
era of broad-scale Federal involvement in education 
and established the principle that with Federal 
education funding comes Federal strings. The cor- 
nerstone of ESEA was Title I (renamed Qiapter 1 in 
1981), which is still the largest program of Federal 
aid to elementary and secondary schools.^ The 
purpose of Title I/Chapter 1, both then and now, is 
to provide supplementary educational services, pri- 
msuily in reading and mathematics, to low-achieving 
children living in poor neighborhoods. With an 
appropriation of $6.2 billion for fiscal year 1991,^ 
Chapter 1 channels funds to almost every school 
district in the countiy. Some 51,000 schools, includ- 
ing over 75 percent of the Nation's elementary 
schools, receive Chapter 1 dollars, which are used to 
fund services to about 5 million children in pre- 
school through grade 12. Given its 25-year history 
and broad reach, the effect of Chapter 1 on Feder^ 
testing policy is profound. 

History of Chapter 1 Evaluation 

From the beginning, the Title I/Chapter 1 law 
required participating school districts to periodically 
evaluate the effectiveness of the program in meeting 
the special educational needs of educationally disad- 
vantaged children, using **. . . appropriate objective 
measures of educational achievement**^ — ^interpreted 
to mean norm-referenced standardized tests. Con- 
gress has revised the evaluation requirements many 
times to reflect changing Feder^d priorities and 
address new State and local concerns* 

During the 1960s and 1970s, the Title I evaluation 
provisions generally became more prescriptive and 
detailed. In 1981, a dramatic loosening of Federal 
requirements occurred: while evaluations were still 
required. Federal standards governing the format, 
frequency, and content of evaluations were deleted. 
In the absence of Federal guidance, confusion about 
just what was required ensuu! at the State and local 



^Thc itmaiixler of this section is from Nancy Kobcr» * 'Hic Role and Impact of Chapter 1 Evaluation and Assessment Requirements/ * OTA contractor 
report* May 1991. 

^this $6*2 billion, approximately $5.5 billion is distributed by fonnula to local school districts. The remainder is used for three State-administered 
programs for migrant stud'snts, students with disabilities* neglected and delinquent children, and for other specialized programs and activities, such as 
State administration and technical assistance. i i 

O 3PublicLaw89-ia 
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President Johnson signing the Elenientary and Secondary Education Act of 1965 at a school In Johnson City, Texas. 
The enactnient of this law wa ^ a milestone In Federal education policy. 



levels. Congress responded by gradually letighten- 
ing the evaluation requirements. The most recent set 
of amendments, the 1988 reauthorization, made 
dapter 1 assessment more consequential and con- 
troversial than ever before by requiring Oiapter 1 
schools to modify their programs if they could not 
demonstrate achievement gains among participating 
children — the so-called * ^program improvement pro- 
visions.** 

Through all these revisions, the purposes of Title 
VOiapter 1 evaluation have remained much the 
same: to determine the effectiveness of the program 
in improving the education of disadvantaged chil- 
dren; to instill local accountability for Federal funds; 
and to provide information that State and local 
decisionmakers can use to assess and alter programs. 

.Specific Requirements for 
Evaluating Programs 

Title I/Chapter 1 is a partnership between Federal, 
State, and local governments, and the evaluation 



provisions reflect this division of responsibility. 
Evaluation of the effects of Chapter 1 on student 
achievement begins at the project level— usually the 
school. Ibst scores of participating children are 
collected from schools, analyzed, and summarized 
by the local education agency (LEA). Each LEA 
reports its findings to the State education agency 
(SEA), which aggregates the results in a report to the 
U.S. Department of Education. (States can, if they 
wish, institute additional requirements regarding the 
format, content, and frequency of Chapter 1 evalua- 
tions.) Congress, by statute, and the Department of 
Education, through regulations and other written 
guidance (particularly the guidance in the Depart- 
ment's Chapter 1 Policy Manual"^), set standards for 
SEAs and LEAs to follow in evaluating and 
measuring progress of Chapter 1 students. The 
Deparimenl also compiles the State data and sends 
Congress a repv^rt summarizing the national achieve- 
ment results, Jong with demographic data for 
Chapter 1 participants. 
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9_ J.S. Dqjartment of Education, Chapter J Policy Manual (Washington, DC: April 1990). 
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Standardized Tests and 
Mandated Evaluations 

Since the creation of the Title VChapter 1 
Evaluation and Reporting System (TIERS) in the 
mid- 1970s, the Department has relied on norm- 
referenced standardized test scores as an available, 
straightforward, and economical way depicting 
Chapter 1 effectiveness. The law, for its part, gives 
an imprimatur to standardized tests, through numer- 
ous references to **testing," **scores," '^objective 
measures,** ''measuring instruments," and ''aggre- 
gate performance/* Chapter 1 evaluation has be- 
came nearly synonymous with norm-referenced 
standardized testing. 

The purpose of TIERS has changed little smce it 
became operative in 1979: to establish standards that 
will result in nationally aggregated data showing 
changes in Chapter 1 students* achievement in 
reading, mathematics, and language arts. lb con- 
form with TIERS, States and local districts must 
report gains and losses in student achievement in 
terms of Normal Curve Equivalents (NCEs), a 
statistic developed specifically for Title I. NCEs 
resemble percentile scores, but can be used to 
compute group statistics, combine data from differ- 
ent norm-referenced tests (NRTs), and evaluate 
gains over time. (Gains in scores, which can range 
from 1 to 99, with a mean of 50, reflect an 
improvement in position relative to other students.^) 
To produce NCE scores, looal districts must use an 
NRT or another test whose scores can be equated 
with national norms and aggregated. Thus, although 
the Chapter 1 statute does not explicitly state that 
LEAs must use NRTs to measure Chapter 1 effec- 
tiveness, the law and regulations together have the 
effect of requiring NRTs because <rf theii insistence 
on aggregatable data and their reliance on the NCE 
standard. 

The 1988 law, as interpreted by the Department of 
Education, changed th : basic evaluation provisions 
in ways that increased the frequency and signifi- 
cance of standardized testing in Chapter 1. Specifi- 
-cally, the law: 



• through the new "program improvement** 
provisions, put teeth into the longstanding Title 
VChapter 1 requirement that LEAs use evalua- 
tion results to determine whether and how local 
programs should be modified. Schools with 
stagnant or decUning aggregate Chapter 1 test 
scores must develop improvement plans, first 
in conjunction with the district and then with 
the State, until test scores go up. 

• gave the Department the authority to reinstate 
national guidelines for Chapter 1 evaluation 
(which had been eliminated in 1981) and 
required SEAs and LEAs to conform to these 
standards. 

• focused greater aUention on (and, through 
regulation, required measurement of) student 
achievement in higher order analytical, reason- 
ing, and problem-solving skills. 

• directed LEAs to develop ' 'desired outcomes, * * 
or measurable goals, for their local Chapter 1 
programs, which could include achievement 
outcomes to be assessed with standardized 
tests. 

• expanded the option for high-poverty schools 
to operate schoolwide projects,^ as long as they 
can demonstrate achievemrat gains (i.e., higher 
test scores) among Chapter 1-eligible children. 

• as interpreted by the Department, required 
LEAs to conduct a formal evaluation that met 
TIERS standards every year, rather than every 
3 years. (In actual practice, most States required 
aimual evaluations.) 

Other Uses of Tests in Chapter 1 

Producing data for national evaluations is only 
one of several uses of standardized tests in Chs^ter 
1. Under the current law and regulations, LEAs are 
required, encouraged, or permitted to use tests for all 
the following decisions: 

• identifying which children are eligible for 
Chapter 1 services and establishing a "cutoff 
score** to determine which children will actu- 
ally be served; 

• assessing the broad educational needs of Chap- 
ter 1 children in the school; 



^Mary Kennedy, Beatrice F. DinDan« and Randy E, Demalinei The Effectiveness of Chapter I Services (Washington* DC: U.S. Dqiartment of 
Education, 1986), p. E-2. 

^Under the schoolwide project option, schools with 75 percent or more poor children may use their Chapter 1 iunds for programs to upgrade the 
educational program for ail children, without regard to Chapter 1 eligibility; in exchange for this greater flexibility, these schools must agree fo increased 
Q countability. 
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• determining the base level of achievement of 
individual Chapter 1 children before receiving 
services (the **pretest''); 

• assessing the level of achievement of Chapter 
1 children after receiving services (the ''post- 
test"), in order to calculate the change data 
required for national evaluations; 

• deciding whether schools with high proportions 
of low-achieving children should be selected 
for projects over schools with high poverty; 

• allocating funds to individual schools; 

• establishing goals for schoolwide projects; 

• detemiining whether schoolwide projects can 
be continued beyond their initial 3-year project 
period; 

• annually reviewmg the effectiveness of Chap- 
ter 1 programs at the school level for purposes 
of program improvement; 

• deciding which schools must modify their 
programs under the ''program improvement" 
requirements; 

• determining when a school no longer needs 
program improvement; 

• identifying which individual students have 
been in the program for more than 2 years 
without making sufficient progress; and 

• assessing the individual program needs of 
students that have participated for more than 2 
years. 

In addition, Congress and the Department of 
Education use standardized test data accumulated 
from State and local evaluations for a variety of 
purposes: 

• justifying continued appropriations and author- 
izations; 

• weighing major policy changes in the program; 

• targeting States and districts for Federal moni- 
toring and audits; and 

• contributing to congressionally mandated stud- 
ies of the program. 

Competing Tensions 

Chapter 1 is a good example of how Congress 
must weigh competing tensions when making deci- 
sions about Federal accountability and testing. For 
example, in Chapter 1, as in other education 
programs, the need for Federal accountability must 



be weighed against the need for State and local 
flexibility in program decisions. Tlie Federal appe- 
tite for statistics must be viewed in light of the 
undesirable consequences of too much Federal 
burden and paperwork — lost instructional time and 
declining political support for Federal programs, to 
nan&e a few. The Federal desire for succinct, 
"objective," and aggregatable data must be judged 
against the leality that test scores alone cannot 
provide a full and accurate picture of Cnapter I's 
other goals and accomplishments (e.g., redistribut- 
ing resources to poor areas, mitigating the social 
effects of child poverty, building children's self 
esteem, and keeping students in school). Finally, the 
Federal need for summary evaluations on which to 
formulate national fundkg and policy decisions 
must be weighed against the local need for meaning- 
ful, child-centered information on which to base 
day-to-day decisions about instructional methods 
and student selection. 

The number of times Congress has amended the 
Chapter 1 evaluation requirements suggests how 
difficult it is to balance these competing tensions. 

Effects of Chapter 1 on Local Testing 

Chapter 1 has helped create an enormous system 
of local testing. Almost every Chapter 1 child is 
tested every year, and in some cases twice a year, to 
meet national evaluation requirements. In school 
year 1987-88, over 1.6 million Cliapter 1 partici- 
pants were tested in reading and just under 1 million 
in mathematics. Sometimes this testing is combined 
with testing that fulfills State and local needs; other 
times Chapter 1 has caused districts to administer 
tests more frequently, or with different instruments, 
than they would in the absence of a Federal 
requirement. 

Because SBAs and LEAs often use the same test 
iiistruments to fulfill both their own needs and 
Chapter 1 requirements, and because States and 
districts expanded their testing programs during 
roughly the period when Chs^ter 1 appropriations 
were growing, it is difficult, perhaps impossible, to 
sort out which entity is responsible for what degree 
of the total testing burden. Although States and 
districts often coordinate their Chapter 1 testing with 
other testing needs, many LEAs report that without 



proposal to amend Title 1 so that all funding would be distributed on the basis of achievement test scores was put forth in (he late 1970s by 
then-Cougressm^n Albeit Quie (R-MN). Tbc proposal was not accepted, but a compromiie provision was adopted^ which remains in the law today* 
3 Hitting school districts to allocate fimds to schools based on test scores hi certain Umited situations. 
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Classrooms Ilk© this In Jefferson Parish. Louisiana, benefit front the extra assistance for disa Jvantaged students provided by 
Chapter 1 of the Gementary and Secondary Education Act Testing has always been a big part of Chapter 1 activity. 



Chapter 1, they would do less testing. A district 
administrator from Detroit, for example, estimated 
that her school system conducts twice as much 
testing because of Oiapter 1.^ The research and 
evaluation staff of the Portland (Oregon) Public 
Schools noted that in the absence of a Chapter 1 
requirement to test second graders, their district 
would begin standardized testing later, perhaps in 
the third or fourth grade.^ (In school year 1987-88, 
about 22 percent of Chapter 1 public and private 
school participants were in grades pre-K through one 
and were akeady exempted from testing. Another 26 
percent of the national Chapter 1 population were in 
grades two and three; these children must be tested 
under current requirements.) One State Chapter 1 
coordinator said that without Chapter 1, his State 
would require only its State criterion-referenced 
instrument, and not NRTs. At the school level, 



principals and teachers express frustration with the 
ainount of time spent on testing and trackir.g test 
data in Chapter 1 and the degree of dismption it 
causes in the academic schedule. 

National studies of Chapter 1 and case studies of 
its hnpact in particular districts have uncovered 
some significant concerns about the appropriateness 
of usmg standardized tests to assess the program^s 
overall effectiveness, make program improvement 
decisions, and deteimine the success of schoolwide 
projects. Over the years. Chapter 1 researchers and 
practitioners have raised a number of technical 
questions about the quality of Chapter 1 evaluation 
data and have expressed caveats about ito limitations 
in assessing the full impact and long-term conse- 
quences of Chapter 1 participation. With the new 
requirements that raised the stakes of evaluation, 
debate over the data's validity and limitations has 



*Sharou Johnson-Lewis, director. Office of Planning, Research and Evaluation, Detroit Public Schools, remarks at OTA Advisory Panel meeting, 
June 28, 1991. 

^This and the other observations about the impact of Chapter 1 on testing practices arc taken from Kobcr. op. cit., foomote 1. Case studies of the 
Philadelphia, PA, and Portland, OR, public schools helped inform OTA's analysis and are cited throughout this chapter. 
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become more heated Foil* example, there is evidence 
from Philadelphia, Portland, and other districts that 
because of measurement phenomena, test results do 
not always target for program improvement the 
schools with the lowest achievement or the weakest 
programs. Similarly, schools with schoolwide proj- 
ects have argued that a 3-year snsq)shot based on test 
scores does not always provide adequate time or an 
accurate picture of the project's success compared 
with more traditional Chapter 1 programs. 

State and local administrators have also expressed 
concerns about the effect of Chapter 1 testing on 
instmction. While administrators and teachers are 
loathe to admit to any practices that could be 
interpreted as ''teaching to the test," there is some 
evidence from case studies and national evaluations 
that teachers make a point to emphasize skills that 
are likely to be tested* In districts such as Philadel- 
phia and Portland, where a citywide test tied to local 
curriculum is also the instrument for Chapter 1 
evaluation, teachers can readily justify this practice. 
Discomfort arises, however, when local administra- 
tors and teachers feel they are being pressed by 
Federal requirements to spend too much time 
drilling students in the type of lower order'' skills 
frequently included on commercially published 
NRTs, or when teachers hesitate to ^ newer 
instroctional approaches, such as coopmtive team- 
ing and active learning, for fear their efforts will not 
translate into measurable gains. 

Of more general concern is the broad feeling that 
for the amount of burden it entails. Chapter 1 test 
data is not very useful for informing local program 
decisions. According to case studies and other 
analyses, teachers and administrators use federally 
miandated evaluation results far less often than other 
more immediate and more student-centered evalua- 
tion methods — e.g.,criterion-referenced tests (CRTs), 
book tests, teacher observations, and various forms 
of assessment — to determine students' real progress 
and make decisions about instructional practices. 
Frequently the mandated evaluations are viewed as 
a compliance exercise — a ''hoop" that States and 
local districts must jump through to obtain Federal 
funding. 

Although Chapter 1 teachers, regular classroom 
teachers, and administrators do occasionally employ 



other types of assessment to make decisions about 
Chapter 1 students and projects, these alternative 
forms are not entrenched in the program in the same 
way that NRTs are, and are seldom considered part 
of the formal Chapter 1 evaluation process. While 
the Chapter 1 law contains some nods in the 
direction of alternative assessment — particularly for 
measuring progress toward desired outcomes and 
evaluating die effects of participation on children in 
preschool, kindergarten, and first grade — the gen- 
eral requirements for evaluation cause local practi- 
tioners to feel that NCE scores are the only results 
that really matter. They believe that alternative 
assessment will not become a meaningful compo- 
nent of Chapter 1 evaluation without explicit en- 
couragement from Congress and the Department. 

One bottcm line question remains: what does the 
large volume of testing data generated by Chapter 1 
evaluation tell Congress and other data users about 
the achievement of Chapter 1 children? lb answer 
this question, it is useful to consider the data from a 
10-year summary of Chapter 1 information, as 
shown in table 3-1.^^ The first thing that is apparent 
from the summary data is how the millions of 
individual test scores required for Chapter 1 evalua- 
tion are aggregated into a single number for each 
grade for each year. Average annual percentile gains 
in achievement — comparing average student pretest 
scores and average post-test scores — have hovered 
in the range of 2 to 6 percentiles in reading, and 2 to 
11 percentiles in mathematics. For some grade 
levels, in some years, there have been greater 
improvements, but in general the gains have been 
modest and the post-test scores have remained low. 
For example, in 1987-88 the average post-test score 
for Chapter 1 fourth graders was the 27th percentile 
in reading and the 33rd percentile in mathematics. In 
analyzing these data it is important to understand 
that Chap^'^r 1 children, by definition, are the lowest 
achieving students in their schools, and that once a 
child's test scores exceed the cutoff score for the 
district that child is no longer eligible fur Chapter 1 
services. There has been some upward trend, more 
pronounced in mathematics than in reading, but 
overall closing of the gap has been slow. In addition, 
because there is no control group for Chapter 1 
evaluation, it is difficult to assess what these 
post-test scores really mean, i.e., how well Chapter 



^^or the complete tables of data referred to in this discussion, sec IJ.S. Deportment of Education, A Summary of State Chapter J Participation and 
HievementIrtformationforJ987'88(y/Bsl^glon,DC: 1990). 
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Table 3-1— Achievement Percentiles for Chapter 1 Students Tested on an 
Annual Cycle, 1979-80 to 1987-88 



Changes In percentile ranks for reading 
Grade 1979-80 1980-81 1981-82 1982-83 1983-84 1984-85 1985-86 1986-87 1987-88 



2 222223244 

3 453444455 

4 344445565 

5 355556544 

6 465655555 

7 - 34346434 

8 I 45444434 

9 244432323 

10 -1 2 1 2 1 2 2 2 2 

11 3 1 -1 0 2 3 3 2 

12 2 0 2 0 1 0 0 e 0 

Changes In percentile ranks for mathematics 

Grade 1979-80 1980-81 1981-82 1982-83 1983-84 1984-85 1985-86 1986-87 1987-88 



2 2553669 10 11 

3 135564677 

4 365456688 

5 446877977 

6 686876777 

7 435756654 

8 455655445 

9 1 1 2 3 1 2 2 5 4 

10 -2 1 0 2 1 2 4 3 4 

11 1 2 1 1 2 3 4 3 3 

12 2 0 1 0 3 2 2 4 -1 



SOURCE: Beth Sinclair and Babette Qutmann, A Summary of Stefe Chaptw 1 Participation and Achhvanwnt 
Infonnatlon for fPd7-^ (Washington, DC: U.S. Dapartmwit of Education, 1990), pp. 49-60. 



1 children would achieve in the absence of any 
intervention.^^ 

For purposes of this analysis, the real question is 
whether the information from these test scores is 
necessary or sufficient to answer the accountability 
questions of interest to Congress. For the disadvan- 
taged population targeted by the program, the 
achievement score gains are evidence of improve- 
ment. Thus, when taken together with other evalua- 
tive evidence about the program's impact, the test 
scores support continued funding. But whether the 
test scores reveal anything significant about what 
and how Chapter 1 children are learning remains 
ambiguous. And in the light of unanticipated effects 
of the extensive testing, it is not clear that tiie 
information gleaned from the tests warrants the 
continuation of an enormous and quite costly 
evaluation system in its present fomi. 



Ripple Effects of Chapter 1 Requirements 

Titie I/Chapter 1 established aprecedent for achieve- 
ment-based accountability requirements adopted in 
many subsequent Federal education programs. In the 
migrant education program added in 1966, the 
bilingual education program added in 1967, the 
Head Start program enacted in the Economic Oppor- 
tunity Amendments of 1967, and programs that 
followed. Congress required recipients of Federal 
funds to evaluate the e^ectiveness of the programs 
funded. As a result of Federal requirements. State 
and local agencies administer a whole range of 
tests — to place students, assess the level of partici- 
pants' needs, and determine progress. Even when 
NRTs are not exphcitiy required, they are often the 
preferred mode of measurement for Federal account- 
ability because they can be applied consistently, are 
relatively inexpensive, and leave a clearly under- 



ERIC 



k lOne of the more vexing evaluation problems has been to Infer ' 'treamient effects' * from studies vvith no control group. For discussion and aualyris 
of methods designed to correct for ' 'regression to the mean* • and other statistical constraints, see Anand Desa^ 
Improvement Due to Compensatory Education Programs," Socio-Economic Planning Sciences, vol. 24, No. 2, 1990, pp. 143-153. 

i2For discussion of outcome-based performance measures in vocational education and job training programs see, eg., U.S. Congress, Office of 
Tbchnology Assessment. "Performance Standards for Secondary School Vocational Education." background papet of the Science, Education and 
tansportttion Frngram, April 1989. 
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stood and justifiable trail for Federal monitors and 
auditors. 

The 1965 ESEA had another, less widely recog- 
nized impact on State testing practices. Title V of the 
original legislation provided Federal money to 
strengthen State departments of education, so that 
they could assume all the administrative functions 
bestowed on them by the new Federal education 
programs. This program helped usher in an era of 
increased State involvement in education and would 
have a significant impact down the road as States 
assumed functions and responsibilities far beyond 
those required by Federal programs or envisioned by 
Congress in 1965. 

Chapter 1 Testing in Transition 

OTA finds that because of its size, breadth, and 
influence on State and local testing practices. 
Chapter 1 of ESEA provides a powerful lever by 
which the Federal Government can affect testing 
policies, innovative test development, and test use 
throughout the Nation. 

OTA^s analysis brings to light several reasons 
why Congress ought to reexamine and consider 
significant changes to the Federal requirements for 
Chapter 1 evaluation and assessment. 

• National policymakers and State and local 
program administrators have different data 
needs, not all of which are well served by 
NRTs. 

• The implementation of the 1988 program 
improvement and schoolwide project require- 
ments has underscored some of the inadequa- 
cies and limitations of using NRTs for local 
program decisions, while simultaneously in- 
creasing the consequences attached to these 
tests. 

• While the uses and importance of evaluation 
data have changed substantially as a result of 
the 1988 amendments, the methods and instm- 
ments for collecting this data have remained 
essentially the same since the late 1970s. A 
better match is needed betwecii Uie new goals 
of the law, particularly the goal to improve the 
quality of local projects, and the tools used 
measure progress toward those goals. 

As Congress approaches Chapter 1 reauthori- 
zation, it should examine how all the pieces that 
O Ifect testing under the umbrella of Chapter 1 fit 



Photo cradr: Tfm Jmik$ Studh of Photognphy 

Research has shown the* earty intervention Is Important, 
and many schools like this one in DanvWOt Vermont 
use Chapter 1 funds for preschool and Idndergarten 
programs. 

together. Many pieces are interrelated, but they do 
not always woik harmoniously. For example, the 
timing and evaluation cycles for Federal, State, and 
local testing in existing law are not well coordinated. 
As part of this review, Congress should pay particu- 
lar attention to the need to revise language that 
inadvertently endorses norm-referenced testing in 
situations where that type of testing may be inappro- 
priate. Options such as data sampling ntiay meet 
congressional needs. Clearer legislative language 
could help maintain and improve accountability, 
because States and local districts would know better 
what was expected. 

The following questions can guide congressional 
deliberations regarding changes in Chapter 1: 

• What information does Congress need to make 
policy and funding decisions about Chapter 1? 
Is Congress getting that information, and is it 
timely and useful? 

• What information does the Department of 
Education need to administer the program? 

• How do the data needs of State and local 
agencies differ from those of the Federal 
Government and each other? 

e Is it realistic to serve national, State, and local 
needs with the same information system based 
on the same measurement tool? 

• Kow well do NRTs measure what Chapter 1 
children know and can do? 

e Is the nationally aggregated evaluation data that 
is currently generated accomplishing what 
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Congress intended? Specifically, do aggregates 
of aggregates of averages of NCE gains and 
negative gains present a meaningful and valid 
national picture of how well Cliapter 1 children 
are achieving? 

• lb what extent is the value of cumulative data 
symbolic rather than substantive? For example, 
is being able to point to a rising line on a chart 
as important as having accurate, meaningful 
data about what Chs^ter 1 children know and 
can do? Can symbolic or oversight needs be 
fulfilled with less burdensome types of testing? 

• What other types of data, beyond test scores, 
might meet Federal policymsdeers* criteria for 
objectivity? 

In summary, OTA finds that Congress should 
revisit the Chapter 1 assessment and evaluation 
requirements in the attempt to lessen reliance on 
NRTs, reduce the testing burden, and stimulate the 
development of new methods of assessment more 
suited to the students and the program goals of 
Chapter 1. A careful reworking of the requirements 
could have widespread salutary effects on the use of 
educational tests nationwide. Congressional options 
for achieving these ends are identified in chapter 1 
of this report. 

National Assessment of Educational 
Progress 

By the late 1960s, Title I/Chapter 1 and other 
Federal programs had produced a substantial amount 
of data concerning the achievement of disadvan- 
taged children and other special groups of students. 
State and local testing told SEAs and LEAs how 
their students stacked up against national norms on 
specific test instruments. What was missing, how- 
ever, was a context — ^a nationally representative 
database about the educational achievement of 
elementary and secondary school children as a 
group, against wtiich to confirm or challenge infer- 
ences drawn from State, local, or other nationwide 
testing programs, 



Although policymakers and the public could draw 
from a wide variety of statistics to make informed 
decisions on such issues as health and labor, they 
were operating in a vacuum when it came to 
education, The Department of Education produced a 
range of quantitative statistics on school facilities, 
teachers, students, and resources, but had never 
collected soimd and adequate data on what Ameri- 
can students knew and could do in key subject areas. 

Francis Keppel, U.S. Commissioner of Education 
from 1962 to 1965, became troubled by this dearth 
of information and initiated a series of conferences 
to explore the issue. In 1964, as a result of these 
discussions, the Carnegie Corp. of New York, a 
private foundation, appointed an exploratory com- 
mittee and charged it with examining the feasibility 
of conducting a national assessment of educational 
attainments. By 1966, the committee had concluded 
that a new battery of tests— K:arefully constructed 
according to the highest psychometric standards and 
with the consensus of those who would use it — 
would have to be developcd.^"^ 

The vision became a reaUty in 1969, when the 
U.S. Office of Education began to conduct periodic 
national surveys of the educational attainments of 
yoimg Americans. The resulting effort, NAEP, 
sometimes called ''the Nation's report card,'' has 
the primary goal of obtaining reliable data on the 
status of student achievement and on changes in 
achievem ;nt in order to help educators, legv;^.ators, 
and others improve education in the United States. 

Purpose 

Tbday, NAEP remains the only regularly con- 
ducted national survey of educational achievement 
at the elementary, middle, and high school levels.^^ 
lb date it has assessed the achievement of some 1.7 
million young Americans. Although not every 
subject is tested during every administration of the 
program, the core subjects of reading, w iting, 
mathematics, science, civics, geography, and U.S. 



^ 3ln 1963, Keppel is Reported lo have lamented the fact that: * 'Congress is continually asking me about bow bad or how good the schools are and we 
have no dependable information. They give different tests at schoolf for different purposes, but we have no idea genendly about the subjects that educators 
value. . . OTA interview with Ralph W. lyier, Apr. 5. 1991. 

i^This early history of the National Assessment of Educational Progress (NAEP) is token from the National Assesszieot of Educational Progress, 
General ittfonnation Yearbook (Washington* DC: National Center for Education Statistics, 1974); and George Madaus and Dan Stufflebeam (eds.), 
EdncaHomlEvatuaHon: Classic Works of Ralph W Tfler (Boston. MA: Kluwer Academic Publishers, 1989). Conversations with Fhmk Womcr, Edward 
Roebcr, and Ralph lyier, all involved in different capacities in the original design and implementation of NAEP, enriched the material found in published 
sources. 

^^Nalional Assessment of Educational Progress, The Writing Report Card (Princeton* NJ: Educational Tbsting Service. 1986). 
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Photo credit: Office of Technology Asseasment, 1992 

Known as the Nation's Report Card, the National Assessment of Educational Progress Issues sunmary reports for 
assessments conducted In a number of academic subject areas. These reports also analyze trends In achievement 

levels over the past 20 years. 



history have been assessed more than once to 
determine trends over we. Occasional assessments 
have also examined student achievement in citizen- 
ship, literature, art, music, computer competence, 
and career development. 

Safeguards and Strengths 

The designers of the NAEP project took extreme 
care and built in many safeguards to ensure that a 
national assessment would not, in the worst fears of 
its critics, become any of the following: ?. stepping 
stone to a national individual testing program, a tool 
for Federal control of curriculum, a weapon to 
' 'blast' ' the schools, a deterrent to curricular change. 



or a vehicle for student selection or funds allocation 
decisions.^^ An understanding of NAEP's design 
safeguards is crucial in order to comprehend what 
NAEP was and was not intended to do and why it h 
unique in the American ecology of student assess- 
ment. NAEP has seven distinguishing characteris- 
tics. 

NAEP reports group data only, not individual 
scores. NAEP results cannot be used to infer how 
particular students, teachers, schools, or districts are 
achieving or to diagnose individual strengths and 
weaknesses. Preventir)n of these levels of score 
reporting was a prerequisite to gaining approval for 



O .^ler, op. cit., footnote 13. 
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the original development wd implementation of 
NAER^7 

NAEP is essentially a batteiy of criterion- 
referenced tests in various subject areas (although its 
developers prefer the term **objective-refcrenced,** 
since NAEP tests are not tied to any specific 
curriculum but measure the educational attainment 
of young Americans relative to broadly defined 
bodies of knowledge). Unlike many conunercially 
published NRTs, NAEP scores cannot be used to 
rank an individual's performance relative to other 
students. This emphasis on criterion-referenced 
testing represents an important shift toward outlin- 
ing how children are doing on broad educational 
goals rather dian how they are doing relative to other 
students. NAEP is the oiUy test to provide this kind 
of informaiion on a national scale. 

NAEP has pioneered a survey methodology 
known as ^^matrix sampling." This app/oich grew 
out of item-response theory, and has been hailed as 
an important contribution to the philosophy and 
practice of student testing.^^ Under this method, a 
sample of students across the country is tested, rather 
than testing all students (wMch wuuld be considered 
a ' 'census' * design). Furthermore, the students in the 
matrix sample do not take a **whole** test, or even 
the same subject area tests, nor are they all given the 
same test items. Rather, each student takes a 1-hour 
test that includes a mix of easy, medium, and 
difiicult questions. Thus, NAEP uses a method of 
sampling, not only of the students, but also of the 
content that appears on t! test. Any student taking 
a NAEP test only takes one-seventh of the test in a 
1-hour testing session. Because of matrix sampling, 
a much wider range of content and goals can be 
covered by the test than most other tests can allow. 
This broad coverage of content is the essential 
foundation of a nationally relevant test, as well as a 
test that is relatively well protected against the 
negative side effects that can occur with teaching to 
a narrow test. It is probable that these important 
strengths of NAEP, which make it a robust and 
nationally credible test, would be difficult to incor- 



porate into a test designed to be administered to 
individuals (unless it were a prohibitively long test). 
In addition, because no individual students can be 
assigned scores, the matrix sampling i^)proach 
inq)oses an inq;x>rtant technological barrier against 
the use of NAEP r^^sults for making student, school, 
district, or State comparisons, or for sorting or 
selecting students. 

NAEP provides comparisons over time, by 
testing riitionally representative san]f)les of 4th, 8th, 
and 12ih graders on a biennial cycle. (Prior to 1980, 
NAEP tested on an annual cycle.) This form of 
sampling deters the kinds of interpretation problems 
thai; can arise when different populations of test 
takers are conq)ared.^' Due to cost constraints, the 
out-of-school population of students that had been 
sampled in early NAEP administrations was elimi- 
nated. 

NAEP strives for consensus about educational 
goals.NAEP's governing board employsaconsensus- 
buliding process for establishing content frame- 
works and educational objectives that are broadly 
accepted, relevant, and forward looking. Panels of 
teachers, professors, parents, community leaders, 
and expeiiS in the various disciplines meet in 
different locales and work toward agreement on a 
common set of objectives for each subject area. 
These objectives are then given to item writers, who 
come up with the test questions. Before the items are 
administered to students, they undergo careful 
scrutiny by specialists in measurement and the 
subject matter being tested and are closely reviewed 
in the effort to eliminate racial, ethnic, gender, and 
other biases or insensitivities.^^ 

Recognizing that changing educational objectives 
over time can complicate its mandate to plot trends 
in achievement, NAEP has developed a valuable 
process for updating test instruments. Using this 
process, NAEP revises test instruments to reflect 
new developments in curricular objectives, at the 
same time maintaining links between current and 
past levels of achievement of certain fixed objec- 



*7Sec.c.g.JamcsHazlctt,UnivcreityofKansas/*AmstoryofthcNaUoi^ 1963-1973/* unpublished doctoiiU 

dissertation, December 1973. 

"The principles of matrix sampling arc now used in many State assessment programs, as weU as in other countries. See chs. 6 and 7 for additional 
discussion. 

»n^or example, this was a maM problem in using the decline in Scholastic Aptitude Tfcst scores as a basis for the inference that overall achievement 
had faUen. See Robert Unn and Stephen Dunbar, *'The Nation's Report Card Goes Home: Good News and Bad About Trends in Achievement, Pht 
Delta Kappan, vol 72. No. 2. Oaober 1990, pp. 127-133. 

20National Assessment of Educational Progress, op. cit, footnote 15. 
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In addition to Inforntatlon alxxjt the Nation as a wtiole. the 
National Assessment of Educational Progress (NAEP) 
reports for four regions of the country as well as by sex, 

raoa/ethnidty , and size and type of community. NAEP does 
not report results for Indlvldua) students, but generates 
Information by sampling techniques. 

tives. In maihematics and reading, for example, 
representative samples of students are assessed 
using methods that have remained staMe ovei the 
past 20 years, while additional samples of students 
are tested using instruments that reflect newer 
methods or changed definitions of learning objec- 
tives. Thus, the 1990 mathematics assessment al- 
lowed some students to i^ calculators, a decision 
generally praised by the mathematics teaching 
conrniunity. The NAEP authors took care to note, 
however, that the results of these samples were not 
commensurate with the mathematics achievement 
results from prior years. 

Although NAEP is predominantly a paper-and- 
pencil test relying heavily on multiple-choice items, 
certain assessments include open-ended questions 
or nontraditional formats. For example: the writ- 
ing assessment requires students to produce writing 
samples of many different kinds, such as a persua- 
sive piece or an imaginative piece; the 1990 
assessment also included a national ' 'writing portfo- 
lio'' of works produced in classrooms; the science 
assessment combines multiple-choice questions with 
essays and graphs on which students fdl in a 
response; and the 1990 mathematics assessment 
included several questions assessing complex problem- 



solving and estimation skills, as reconiimended by 
the mathematics teaching profession. 

During its early years, NAEP experimented with 
even more varied test formats and technologies, 
conducting performance assessments in music and 
art that were administered by trained school person** 
nel and scored by trained teachers and graduate 
students. Althou^ many of its more innovative 
approaches were suspended due to Federal funding 
constraints,^^ many State testing programs continue 
to use the performancf^ assessment technologies 
pioneeied by NAEP. Moreover, NAEP continues to 
be a pioneer in developing open-ended test items 
that can be used for large scale testing; this is 
possible largely due to matrix sampling. 

Accomplishments 

All of these strengths have ^ent NAEP a degree of 
respect that is exceptional among federally spon- 
sored evaluation and data collection efforts. NAEP 
has produced 20 years of unparalleled data and is 
considered an exemplar of careful and innovative 
test design. NAEP reports are eagerly awaited before 
publication and widely quoted afterwaid. In addi- 
tion, NAEP collects background data about stu- 
dents' family attributes, school characteristics, and 
student attitudes and preferences that can be ana- 
lyzed to help understand achievement trends, such as 
the relationship between television and reading 
achievement. 

Because of NAEP, the Nation now knows, among 
ouer trends, that Black students have been narrow- 
ing^ the achievement gap during the past decade, 
9-year-olds in general read better now than they did 
10 years ago, able 13-year-olds do less well on 
Iiigher order mathematics skills than they did 5 years 
ago, and children who do homework read better than 
those who do not. 

Caveats 

A relatively recent issue has emerged with poten- 
tial consequences for NAEP administration and for 
interpretation of NAEP results. Researchers have 
begun to question whether NAEP scores tend to 
underestimate knowledge and skills of American 
students, precisely because NAEP is perceived as a 
low-stakes test. The question is whether students 
perform at less than their full ability in the absence 



Q ^iPor discussion of the 1974 f mding crisis, see Hazlett, op. cit., footnote 17, pp. 297-299. 
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of extrinsic motivation to do well. It is not purely an 
academic question: much of today^s debate over the 
future of American education and educational test- 
ing turns on public perceptions of the state of 
American schoolings perceptions based at least in 
part on NAEP. 

Some empirical research on the general question 
of motivation and test performance has already 
demonstrated that the issue may be more important 
than originally beheved. For example, one study 
foimd that students who received . . special in- 
structions to do as well as possible for the sake of 
themselves, their parents, and their teachers . . / ^ did 
significantly better on the Iowa Tfests of Basic S -rills 
than students in the control group who received 
ordinary instractions.^ This residt supports the 
general findings in research discussed in the preced- 
ing chapter;^^ and another analyst*s observation that 
' * . . . when a serious incentive is present (high school 
graduation), scores are usually higher/*^ 

Prompted by these and other findings, several 
researchers are conducting empirical studies to 
determine the specific motivational explanations of 
performance^ on NAEP. One study involves experi- 
mental manipulation of instructions to NAEP test 
takers; the other involves embedding NAEP items in 
an otherwise high-stakes State accountability test.^^ 
Data are to be collected in spring 1992. "ilie results 
of these studies will shed Ught on 9ii important 
aspect of how NAEP scores should be interpreted.^ 

The 1988 Amendments 

The original vision of NAEP has been diminished 
by years of budget cuts and Hnancial constraints. 
Some of what NAEP once had to offer the Nation has 
been lost as a result. Concomitantly, over the past 
few years, new pressures have arisen in the attempt 
to adapt NAEP to serve purposes for which it was 
never intended. Some of this pressure has come from 
policymakers frustrated with the lack of effect of 



NAEP if!sults in shaping educational policy and the 
lelatively ''low profile'' of the test and the results. 
Responding in part to this pressure. Congress took 
some cautious steps in 1988 to amend NAEP to 
provide new types of information. 

One dilemma that surfaced during NAEP's first 
two decades was that its results did not appear to 
have much impact on education policy decisions, 
especially at the State and local levels. While 
ttieoretically NAEP coukl provide benchmarks against 
which State and local education authorities could 
measure their own progress, many educators argued 
that the information was too general to be of much 
help when they made decisions about resource 
allocations. Others observed that since NAEP cur- 
ried no expLjit or inq)licit system for rewards or 
sanctions, there was simply no incentive for States 
and localities to pay much attention to its results. 

Had NAEP not been so highly respected, criti- 
cisms about its negligible influence on policy might 
have been considered minor, but given NAEP's 
reputation, its lack of clout was viewed as a major 
lost opportunity. Pressure mounted to change NAEP 
to make State and local education authorities take 
greater heed of its message. These voices for change 
were quickly met by experts who reissued warnings 
from the past: that any attempts to use NAEP for 
purposes other than analyzing aggregate national 
trends would con^romise the value of its informa- 
tion and ultimately the integrity of the entire NAEP 
program.^^ The principal concerns were: 

1. that turning NAEP into a high-stakes test 
would lead to the kinds of score * ^inflation* * or 
**pollution** that have undermined the credi- 
bib:y of other standardized tests as indicators 
of achievement (see ch. 2); and 

2. that using NAEP to compare student attain- 
ment across States would induce States to 
change their curricula or instruction foi the 



22Stcvcn M. Brown and Herbert J. Walbcrg. I University of Illinois at Ctilcago. * 'Motivational Effects on Tbst Scores of Elemcntaiy School Students: 
An Experimental Study/* monograph, 1991. 

23SCC Daniel Korctz. Robert Linn. Stephen Dunbar, and Lome Shcpard. ' 'The Eflfccts of High Stakes Tbsting on Achievement: Preliminary Wod^s 
AboutGcncralization Across Tksts/* paper presented at the annual mcctlog of the American Educational Research A^ 1991. 

«See Paul Burke, **You Can Lead Adolescents to a Tfest But You Can't Make Them Try/* OTA contractor report. Aug. 14. 1991. p. 4, 

25Robcrt Linn, University of Colorado at Boulder, personal communication, November 1991. 

2«For discussion of general issues regarding the public's und jrstauding of National Assessment of Educational Progress scores, see Robert Forsyth, 
"Do NAEP Scales Yield \«id Criterion-Reference^ Interpretations?** Educational Measurement: Issues and Practice, vol 10, No. 3, fall 1991. pp. 
3*9; and Burke, op. cit., footnote 24. 

27xiic strongest early wamitigs about NAEP were found in Harold Hand, * 'National Assessment Viewed as the Camel's Nose, * * Phi Delta Kappan. 
,9^ L 47, No. 1. 1965. pp. 8-12; and Harold Hand. **Reclpe for Control by the Few." Educational Forum, vol. 30. No. 3. 1966. pp. 263-272, 
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sake of showing up better on the next test, 
rathei than as a result of careful deliberations 
over whel should be taught to which students 
and under what teaching methods. 

When NAEP came up for congressional reauthoriza- 
tion in 1988, it was imid a climate of growing public 
demands for accountability at all levels of education 
(fueled in part, ironically, by NAEP's own reports of 
mediocre istudent achievement in critical subjects). 
Almost a decade of serious education reform efforts 
had made little visible impact on American students' 
test scores, especially relative to those of interna- 
tional competitors. 

Trial State Assessment 

Congress responded by authorizing, for the first 
time. State-level assessments, to be conducted on a 
voluntary, trial basis. Be^^inning mdi the 1990 
eighth grade mathematics assessment and the 1992 
fourth grade mathematics and reading assessments, 
NAEP results were to be published on a State-by- 
State basis for those States that chose to participate. 
Congress considered this amendment a trial, to be 
followed up with careful evaluation, before the 
establishment of a full-scale. State-level NAEP 
program could be considered. 

While proponents believed that the experiment 
would yield useful information for SEAs, critics 
worried that a State-by»State assessment would 
invite fruitless con^arisons among States that did 
not take into account other factors influencing 
achievement; would put pressure on States to teach 
to the test or fmd other ways to artiHcially inflate 
scores; or would lead to general ''education bash- 
ing.'' Most importantly, critics cautioned that with 
the State assessment Congress would eventually 
succumb to pressure to allow assessments and 
comparative reporting by district, by school, or even 
by student — a travesty of NAEP* s original purpose 
and design. 

Thirty-seven States, the District of Columbia, 
Guam, and the Virgin Islands participated in the first 
trial State assessment of mathematics, conducted in 
1990. Results were released in June 1991.^ As 
expected, some media reports focused on the inevi- 
table question of: * 'Where does your State rank? * ' In 



general, however, the consequences of the trial will 
not be apparent for some time* In addition to 
analyzing the effects of the trial on the quality and 
validity of NAEP data and on State and local policy 
decisions, observers are likely to focus on whetha 
the information will be worth the high cost of 
administering the State assessments, and whether 
the cost of the State programs will crowd out other 
necessary expenditures or improvements in the basic 
NAEP program. 

Standard Setting 

The 1988 reauthorization made another funda- 
mental revision in the original concept of NAEP. 
From its inception, NAEP had reported results in 
terms of proficiency scales, pegged to everyday 
descriptions of what children at that performance 
level could do. For exan^le, a 200 score in reading 
meant that students ' ' . . . have learned basic compre- 
hension skills and strategies and can locate and 
identify facts from symple informational paragraphs, 
siories, and news articles/'^ NAEP has been 
commended for its accuracy in describing how 
things are. In the late 1980s, however, it came under 
criticism because it was silent on how things ought 
to be. Those who saw NAEP as a potential tool for 
reforming schools or measuring progress toward the 
President's and the Governors' National Goals for 
the year 2000 thought that NAEP should set 
proficiency standards — benchmarks ^^f what stu- 
dents should be able to do. As with the statewide 
assessment proposal, the recommendation for profi- 
ciency standards raised the hackles of many educa- 
tors, researchers, and policymakers. Opponents of 
the proposal said it world undermine local control of 
education; increase student labeling, tracking, and 
sorting; and compromise NAEP's original purpose 
and validity. 

The 1988 amendments created a new governing 
body, the National Assessment Governing Board 
(NAGB), and charged it with identifying «. ap- 
propriate achievement goals for each age and grade 
in each subject area.'' NAGB has completed the 
standard-setting process for mathematics in 4th, 8th, 
and 12th grades, and in doing so, generated consider- 
able controversy. Many observers felt that the 



^See Ina VS. Mullis» John A, Dossey. Eugene H. Owen« aod Gary W. PfailUps. Educadooal Ibsting Service* The State of Mathematics Achievement^ 
prepared for the National Center for Education Statistics (Washington, DC; U.S. Deiiartment of Education, Education Information Branch, June 1991). 

^or analysis of National Assessment of Educational Progress* definitions of literacy see John B. Carroll, * *The National Assessments in Reading: 
O ,5 We Misreading the Findings?** Phi Delta Kappan, vol. 68, No. 6, Fcbniaiy 1987, pp. 424-430. 
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mathematics standards were hammered out too 
quickly, before true consensus was achieved. 

Adding the trial State assessment and standard- 
setting activity increased NAEP funding from about 
$93 million in fiscal year 19S9 to o^ er $17 million 
in fiscal year 1990 (nominal dollars). 

NAEP in Transition 

When authorization for the trial State assessments 
and standard-setting processes expires, Congress 
will face the issue of whether to continue and expand 
these efforts. As of now, Congress has authorized 
planning for the 1994 trial, but has not iq)propriated 
funds for the implementation of the trial itself. The 
Administration's '^America 2000 Excellence in 
Education Act' ' recommends authorization of State- 
by-State con^arisons in five core subject areas 
(mathematicst science, English, history, and geogra- 
phy) beginning in 1994 as a means of monitoring 
(and stimulating) progress toward the National 
Goals. The Administration's bill also suggests that 
tests used in NAEP be made available to States that 
wish to use them for testing at school or district 
levels at their own expense. 

In conclusion, the basic question facing Congress 
is whether to make NAEP even more effective at 
what it was originally intended to do, or to explore 
ways that NAEP could serve new purposes. OTA 
finds that any major changes in NAEP should be 
carefully evaluated with respect to potential effects 
on NAEP's capacity to serve its original purpose. 

National Testing 

Overview 

Perhaps the proposals with the most far-reaching 
implications for the Federal role in testiag are those 
calling for the creation and implementation of a 
national testing program. Althou^ the objectives of 
the various national testing proposals are somewhat 
unclear, they appear to rest on two basic assump- 
tions: first, that the skills and knowledge of most 
American schoolchildren do not meet the needs of a 
changing global economy; and second, that new 
tests can create incentives for the teactung and 
learning of t^e appropriate knowledge and skills. 
Momentum jr these efforts has built rapidly, fueled 
by numerous govenrnental and commission reports 
on the state of tlie economy and of the educational 
O tern; by the National Goals initiative of the 

ERIC 




The National Assessment of Educational Progress has 
developed and pilot tested a variety of hands-on soierKe 
and mathematioe tasks. In this examplei students watch 
an administrator's demonstratkti of oentrlfugal force 
and then respond to written questions about what 
occurred in the demmstration. 

President and Governors; by casual references to the 
superiority of examination systems in other coun- 
tries; and most recently by the President's ^ ^America 
2000*' plan. 

Taken together, the questions of purpose and 
balance between local control an^ national interest 
frame the debate regarding the desirability of 
national testing. This debate n'Mst reflect both the 
needs of the Nation and the well being of individual 
students. 

G)ngress provides the best forum for review of 
this question. Commitment to such a test represents 
a major change in education policy and should not 
be undertaken lightly. A number of issues must be 
considered in weighing the concept. 

Will testing create incentives that motivate 
students to v;ork harder? What are the effects of 
tests on the motivation of students? Tbsts should 
reward classroom effort, rather than undermine it. 
Ibsts built on comparing students to one another, for 
example, may reinforce the notion that effort does 
not matter, since the bell curve design of norm 
referencing always places some students at the top, 
some at the bottom, and most in the middle. 

107 
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Furthermore, if the test is of no consequence to the 
students, they may not be motivated to try hard or to 
study to prepare for it. The motivation of those who 
do poorly on tests must be carefully considered. 
Those students who repeatedly experience failure on 
tests (starting in the earliest years of schooling), 
without any assistance or guidance to help them 
master test content, are imlikely to be motivated by 
a high-stakes test. Positive motivational effects are 
likely only if students perceive they have a good 
chance of achieving the rewards attached to strong 
test performance. 

How broad will the content and skills covered 
be? Can just one test be offered to all students at a 
particular grade level, or will there need to be a range 
of tests at various levels and disciplines? This affects 
the testing burden on any one student and the ;ange 
of levels at which testing can be focused. In some 
European countries, for example, students take 
subject-specific examinations at a choice of levels. 
Some examinations take many hours or are adminis- 
tered over several days, with combinations of testing 
items and formats that call on a range of performance 
by the student. 

Would the test be voluntary or mandatory? 
Voluntary tests sound appealing. However, if a test 
becomes very widely used or needed for access to 
important resources, it will no longer be truly 
voluntary. Qioosing not to take a test may not be a 
neutral option; negative consequences may result for 
those who choose not to be tested. This is especially 
true if a test is usc'J for selection or credentialing; 
without a test result in hand, what chance does the 
student have? Furthermore, voluntary tests do not 
provide an accurate picture if the goal is school 
accountability. If only those students, school;, 
districts, or States that feel they can do well on a test 
participate in it, the results give an inaccurate picture 
of achievement. The claim that an important test can 
be voluntary should be taken with a grain of salt. 

What happens to those who fail? Are there 
resources provided to help them? If consequences 
for failure are high and a student has no recourse 
once the examination has been taken, the wisest 
choice for a student who is having difficulty in 
school is to skip the examination altogether. The 
negative effects of examinations on students who do 
not do well have been a matter of serious concern in 
many European countries. Some countries have 
O 1 dismayed to find that some students leave 



school before required high-stakes examinations are 
offered, rather than face the indignity and stigma that 
accompanies failure. Tliis has also occuned with 
high school graduation examinations in some parts 
of this country. Rather than punishing those who do 
not succeed at standards that seem unattainable^ tests 
can be designed to make standards more explicit and 
the path to their acquisition more cle^ir. However, if 
it is certain that low scores do not mean failure but 
that additional or refiK^used resources will be pro- 
vided to the students testing can have positive 
outcomes. 

Who will design the tests and set performance 
standards? In the decentralized U.S. educational 
system, national testing proposals raise questions of 
State and local responsibility for determining what 
is taught and how it is taught. Can any test content 
be valid for the entire Nation? Who shall be charged 
with determinmg test content? It is important to 
recall that achievement tests by definition must 
assess material taught in the classroom. As the 
content of a test edges away from the specifics of 
what is delivered in classrooms, based on State- 
defined curricular goals, and searches instead for 
common elements, it can become either a test of 
''basic skills'' or oi more general skills and under- 
standings. In the latter case, however, the test risks 
becoming more a measure of aptitude than one of 
achievement. (See also, ch. 6, box 6^ A.) Similarly, 
setting performance standards on a national basis 
assumes the feasibility of consensus not only on 
what is taught and measured, but also on what 
constitutes acceptable performance, and on proce- 
dures to distinguish among levels of performance. 

Will the content and grading standards be 
visible or invisible? Will the examinations be 
secret or disclosed? Experience from the classroom 
and other countries suggests that students are more 
motivated and will leam better when tLey under- 
stand what is expected of them and when they know 
what competent performance looks like. It is impor- 
tant to note that in Europe the impact of examina- 
tions on teaching and learning — ^what is taught and 
learned and how it is taught and learned — is 
mediated through the availability of past examina- 
tii)n papers. The tradition in this country is just the 
opposite. Most high-stakes examinations are kept 
secret, in part because of high development costs. 
For a national examination to have ;alutary effects 
on learning, the additional costs of item disclosure 
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should be weighed against the larger impact of the 
exaniination on teaching and learning. 

Would the examination be administered at a 
single setting or several times, perhaps when 
students feel ready? This question affects students' 
control over the opportunity to study and prepare for 
an examination. If students can schedule a test when 
they feel they have mastered the material, they are 
more likely to be motivated by a realistic expectation 
of success. Conversely, accountability examinations 
are more likely to require single-sitting administra- 
tion if they measure achievement within a conmion 
timeframe. 

Do students have a chance to retake an exami- 
nation to do better? Allowing retakes suggests a 
mastery model in which effort is rewarded and 
students can try again if they do not master the 
material the first time. It reinforces the idea that 
students can leam what they need to know. 

Would the tests be administered to samples of 
students or individuals? If a test is intended to 
increase student motivation, then it will ha e to be 
an individual test. However, tests adminiyu^ed to 
individuals need safeguards to meet hi^h technical 
standards if they will affect the fiiMre opportunities 
of individuals. 

At what age are students to be tested? American 
elementary schoolchildren are tested far more often 
than their European counterparts, especially with 
standardized examinations. Much of the rationale 
for this testing is related to the selection of children 
for Chapter 1 services and frt identification of 
progress within those programs. This testing has had 
a spill-over effect greatly influencing overall ele- 
mentary school testing practice. However, the use of 
multiple-choice, standardized norm-referenced test- 
ing of elementary school children in general, and 
young (prior to grade three) children in particular, is 
under attack by those who see the negative conse- 
quencev^ of early labeling. Thus, the suggestion of a 
new national examination at this age stands in 
contrast with efforts in many States to reduce early 
childhood standardized testing and to use instead 
teacher assessments, checldists, portfolios, and other 
forms of performance-based assessments. 

What legal challenges might be raised? Legal 
challenges based on fairness have become a part of 
the Americai\ landscape. Public policy in this 
,^^i)untry is based on assurances of equal protection 



under the law; furthermore, cultural and racial 
diversity make equity issues far more significant in 
this country than in most others. Ibsts must meet 
these challenges by careful design that assures that 
the administration and scoring procedures are fair, 
the content measures what all participants have been 
taught, and the scores are used for the purposes 
imderstood and agreed to by the participants. 

What test formats will be used? Ibsts send 
important signals to students about the kinds of 
skills and knowledge they need to leam. Ibsts that 
rely on a single format, such as multiple choice, are 
likely to send a limited message about necessary 
skills. As noted earlier, the United States and Japan 
are the onJy countries to rely almost exclusively on 
multiple-choice paper-and-pencil examinations for 
testing. Current proposals for national tests range 
from the use of multiple-choice norm-roferer'^ed 
standardized tests to the use of state-of-the-art' ' 
assessment practices. Ibst format and procedures for 
scoring go hand in hand. Because performance 
assessments generally involve scoring by teachers or 
other experts, they are more expensive than machine- 
scorable tests. A diversity of fcrmats in tasks and 
items may be the best means of balancing tradeoffs 
between the kinds of skills and understandings that 
any one test can measure and the costs of testing. 

Conclusions 

The answers to these questions will shed light on 
the larger questions of whether or not national 
testing is desirable. Goals must be clearly set to 
determine the kind of tests, content, costs, and 
potential linkages to curriculum. For exan^le, if 
Congress sets as its goal increasing student effort for 
higher achievement by testing in specific subjects, 
one would expect mandatory tests, administered to 
all individuals, with the content made explicit 
througli a conmion syllabus covering a broad scope 
of material, with past test items made public iso 
students can study and practice for them. If other 
countries are to be a guide, this kind of examination 
is not used for testing children under the age of 16 or 
even 18. Some States are already using tests of this 
sort (e.g.. New York Regents, California Golden 
State Examinations) for students as high school- 
leaving examinations. Congress should consider 
how the participation of these States would be 
affected, or how these tests could serve as models for 
use, or be calibrated to match some nat onal 
standard. 

lO'j 
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Furthermore, if the goal is to encourage perfoim- 
anc V ihat includes direct measures of complex tasks, 
then written essays^ portfolios of work over time, or 
oral presentations may be called for. These tests 
would t)e considerably more costly to develop, 
administer, and score than machine-scored norm- 
referenced examinations. Ibsts of this type are not as 
carefully researched and may be challenged if used 
prematurely for hign-stakes outcomes like selection 
or certification. 

At present, there is controversy over the use of 
many test results. The development and use of tests 
is complicated, both in terms of science and politics. 
If a test is placed into service at the national level 
before these important questions are answered, 



0T4 finds that the test could easily become a 
tarrier to many of the educational reforms that 
have been set into motion and become the next 
object of concern and frustration within the 
American educational system. 

Congress should consider the questions of test 
desirability and use fiiist, and then consider policy 
directions that emerge from these conclusions. This 
deliberation cannot be separated from a comprehen- 
sive look at the other issues discussed in this section, 
specifically, the role of NAEP In the national testing 
mosaic, the ways testing is used for Chapter 1 
purposes, and how students* interests are to be 
protected. The policy implications of these choices 
are considered collectively in chs^ter 1. 
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Lessons From the Past: A History of 
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Highlights 

• Since dieir earliest administration in the ]mid-*19th century, standardized tests have been used to assess 
student learning, hold schools accountable for results, and allocate educational opportunities to 
students* 

• Throughout die history of educational testing, advances in test design and innovations in scanning and 
scoring technologies helped make group-administered testing of masses of students more efficient ^md 
reliable. 

e High-stakes testing is not a new phenomenon. From die outlet, standardized tes^s were used as an 
instrument of school reform and as a prod for student learning. 

• Foraiial written testing began to r^lace oral exanoination^ 

changed tiiebr mission tcm servicing die elites to educating the masses. Since tiien tests have remained 
a sywhol of die American commitment to nms education, bodi for dieir perceived objectivity and for 
dieir undeniable efficiency. 

• Aldiough standardized tests were seen by some as instruments of fairness and scientific rigor applied 
to education, diey were soon put to uses diat exceeded die technical limits of dieir design. A review 
of di:^ history of achievement testing reveals that the rationales for standardized tests and the 
controversies surrounding test use are as old as testing itself. 



The burgeoning use of tests during the past two 
decades — to measure student progress, hold stu- 
dents and their schools accountable, and more 
generally solidify various efforts to improve school- 
ing — has signified to some observers a **. . .pro- 
found change in the nature and use of testing ''^ 

But the use of tests for the dual purposes of 
measuring and influencing student achievement is 
not a historical anomaly. The du-ee principal ration- 
ales for student testing — classroom feedback; sys- 
tem monitoring; and selection, placement, and 
certiAcation — have their roots in practices that 
began in the United States more than 150 years ago. 
And many of die points diat frame die testing debate 
today, such as die potential for test misuse, echo 
arguments that have been sounded since the begin- 
ning of standardized student testing. 

This chapter surveys the evolution of student 
testing in American schools, and develops four 
themes: 



1. Ibsts in die United States have always been 
used to ascertain the effects of schooling on 
children, as well as to manage school systems 
and influence curriculum and pedagogy. Ibsts 
designed and administered from beyond class- 
rooms have always been more useful to 
administrators, legislators, and other school 
authorities than to classroom leachers or stu- 
dents, and have often been most eagerly 
applied by those seeking school reform. 

2. The historical use of standardized tests in the 
United States reflects two fundamentally Amer- 
ican beliefs about the organization and alloca- 
tion of educational opportunities: fairness and 
efficiency. The fairness principle involves, for 
example, assurances to parents that their chil- 
dren are offered opportunities similar to those 
givra children in other schools or neighbor- 
hoods. Efficiency lefers to the orderly provi- 
sion of educational services to all children. 
These have been the foundation blocks for the 



^George Madaus, quoted in Edward D. FIske, ' 'America's Ibst Mania,* * The New York Times, Apr. 10, 1988, section 12, p. 18. See ctt 3 of this report 
O a detailed account of the rise of testing In the 1970s and 1980s. 
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American system of mass public schooling; 
testing has been a key ingredient of the mortar. 

3. Increased testing has engendered tension and 
controversy over its effects. These tensions 
reflect the centrality of schooling in American 
life» and competing visions of the purposes and 
methods of education within American plural- 
ism. Demand for tests stems in large part from 
demand for fair treatment of all students; the 
use of tests, however, especially for sorting 
and credentialing of young persons, has al- 
ways raised its own questiixis of faimess. 

4. As long as schooling continues to play a 
central role in American life, and as long as 
tests are used to assess the quality of education, 
testing will occupy a prominent place on the 
pubhc policy agenda. The search for better 
assessment technologies will continue to be 
fraught with controversies that have as much to 
do with testing per se as with conflicting 
visions of American ideals and values. 

This chapter focuses on testing through four 
chronological perioos. The first section begins with 
the initial educational uses of standardized written 
examinations in the mid-19di century and continues 
through the development of mmtal (intelligence) 
measurement near the end of that century. The next 
section covers the onset of intelligence and achieve- 
ment testing in the schools, a movement spurred 
largely by managerial and administrative concerns 
and supplied, in large part, with the newly develop- 
ing tools of scientific'' testing. Hie thLrd section 
focuses on trends in educational testing from the end 
of World War I through the end of World War n, a 
period marked by important technological advances 
as well as refinements in the art and science of 
testing. The last section of this chapter is a discus- 
sion of the pivotal role of testing in the stniggle for 
racial equality, increased educational access, and 
international technological competitiveness in the 
years after World War H. 



Achievement Tests Come to 
American Schools: 1840 to 1875 

Overview 

The period from 1840 to 1875 established several 
main currents in the history of American educational 
testing. First, formal written testing began to replace 
oral examinations administered by teachers and 
schools at roughly the same time as schools changed 
their mission from servicing the elite to educating 
the masses. Second, although the early standardized 
examinations were not designed to make valid 
comparisons among children and their schools, they 
were quickly used for that purpose. Motivated in part 
by a deep commitment to faimess in educational 
opportunities, the use of tests soon became contro- 
versial precisely over challenges to their faimess as 
a basis for certain types of comparisons — challenges 
leveled by some teachers and school leaders, al- 
though not by the most active crusaders on behalf of 
free and universal education. Third, the early written 
examinations focused on the basics — the major 
school subjects — even though the objectives of 
schooling were understood to be considerably broader 
than these topics. Finally, from their inception 
standardized tests were perceived as instruments of 
refonn:^ it was taken as an article of faith that 
test-based information could inject the needed 
adrenalin into a rapidly bureaucratizing school 
system. 

Demography, Geography, and Bureaucracy 

Ibsts of achievement have always been part of the 
experience of American school chil^en. In the 
colonial period, school supervisors administered 
oral examinations to verify that children were 
learning the prescribed material. Later, as school 
systems grew in size and complexity, the design, 
purposes, and administration of achievement testing 
evolv^^d in an effort to meet new demands. Well 
before the Civil War, schools used externally 
mandated written examinations to assess student 
progress in specific cimicular areas and to aid in a 



2* •Reform" means diffcrcnl things to different people, especiaUy with respect to cducatioa In this report the word is intended neutrally. i.c., as 
"change;* although it clearly connotes the intention to improve, upgrade, or widen children's educational experiences. The possibility that good 
intentions can lead to unintended consequences is the central theme in such works ai Michael B. Kaiz, The Irony of Early School Reform (Cambridge. 
MA: Harvard Univenity Press, 1968). See also Uwrence Cremin, The Transformation of the School: Progressivism in American Education, I876-I957 
^'-w York, NY* Vintage Books, 1964) for an evf^n broader exploration of change, i.e., as "transformation' ' of the school 
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variety of administrative and policy decisions.^ As 
early as 1838 American educators began articulating 
ideas that would soon be translated into the formal 
assessment of student achievement. 

What were the main factors that led to this interest 
in testing? What were the main purposes for testing? 
Some of the answers lie in the demography and 
political philosophy that shaped the 19th century 
American experience. 

Between 1820 and 1860 American cities grew at 
a faster rate than in any other period in U.S. history, 
as the number of cities with a population of over 
5,000 increased from 23 to 145.^ That same period 
saw an average aimual immigration of roughly 
125,000 newcomers, mostly Europeans (see figure 
4-1).^ Coincident with this inunigration and urbani- 
zation, the id'^a of universal schooling took hold. By 
186C . . a majority of the States had established 
public [primary] school systems, and a good half of 
the nation's children were abeady getting some 
formal education.''^ Some States, like Massachu- 
setts, New York, and Pennsylvania, were moving 
toward free secondary school as well. 

Although ii is difficult to establish a causal link 
between these demographic and educational changes, 
surely one thing that attracted European immigrants 
was the ideal of opportunity embodied in the 
American approach to universal schooling. Follow- 
ing his visit to the United States in 1831 to 1832, the 
Frenchman Alexis de Ibcqueville shared with his 
countrymen his conviction that there was no other 
country in ttie world where ' "... in proportion to the 
population there are so few ignorant and at the same 
time so few learned individuals. Primary instruction 
is within the reach of everybody; superior instruc- 
tion is scarcely obtained by any.''*^ 



Figure 4-1— Annual Immigration to the United Spates: 
1820-60 
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SOURCE: Offlo* of Technology A»Ma9m«nt, bftwd on data from U.S. 
Department of Commeroa, Buraau of tha Cantus, Histty.icaJ 
Stetff tfcf of tfva Unit0dl Stefaf, Coton/a/ Tlm$ to 197J (V/ath- 
Ington, DC: 1975), pp. 105-111. 

At the same time, it could be argued that 
population growth and increased heterogeneity ne- 
cessitated the crafting of institutions — such as uni- 
versal schooling — to ^'Americanize'* the masses. 
The 20th century social philosopher Hanah Arendt 
wrote^ for example, that education has played a 

. . different, and politically incomparably more 
important, role [in America] than in other coun- 
tries,' ' in large part because of the need to American- 
ize the immigrants.^ 

The concept of Americanization extended well 
beyond the influx of inunigrants who arrived in the 
latter half of the 19th century, however. The 



^Maoy historians of American educatioos^ testing focus on the influence of the intelligence testing movement, which began at the end of the 19th 
century. See, e.g., Daniel Resnick, ^^The History of Educational It&th^g,** Ability Vssting: Uses, Consequences, and Controversies, part 2, Alexandra 
Wigdor and W. Gamer (eds.) (Washington, DC: National Academy IVess, 1982), pp. 173*194; or Walter Haney, '"Ibsting Reasoning and Reasoning 
About Ibsting/* Review of Educational Research, vol. 54, No. 4, winter 1984, pp. 597-654. 

^David lyack. The One Best System: A History of American Urban Education (Cambridge, MA: Harvard Univenity Press, 1974), p. 30. 

5U.S. Department of Commerce, Bureau of the Census, Historical Statistics of the United States: Colonial Times to 1970, part 1 (Washington, DC: 
U.S. Govenunent Printing Office, 1975), p. 106. 

%remin, op. cit., footnote 2, p. 13. This chapter relies heavily on Cremin's work, but also on important educational historiography of David lyack, 
Michael Katz, Ira Katzntlson, Margaret Weir, and Carl Kaestle. 

''See Alexis de Ibcqueville, Democracy in America, vol. 1 (New York, NY: Vini Bge Books, July 1990), p. 52, 

^Hannah Arendt, ''The Crisis in Education,* * ParHsan Review, vol. 25, No. 4, fall 1958, pp. 494-495. See also Diane Ravitch, The Great School Wars: 
New York City, 1805-1973 (New York. NY: Basic Books. 1974). p. 17 1 , for her treatment of some of dbe early American educators (like William Hcmy 
^well in New York) who saw schooling as the . . antidote to problems that were social, economic, and political in nature.** 



106 • Itsting in American Schools: Asking the Right Questions 



foundation for a political role for education had 
already been laid in the colonial and post- 
Revolutionary periods, as religious, educational, and 
civic leaders began considering the possible rela- 
tionships between lack of schooling, ignorance, and 
moral delinquency. These leaders, especially in the 
burgeoning cities, advocated public schooling for 
poor children who lacked access to church-run 
charity schools or to common pay schools (schools 
available to all children in an area but for which 
parents paid part of the instructional costs). 

Up until the mid- 19th century, the pattern of 
education consisted of private schools run by paid 
tutors, State-chartered academies and colleges witii 
more formal programs of instruction, benevolent 
societies, and church-run charity schools — ^in siun, a 
''hodge-podge'' reflecting the many: 

. . . motives that impelled Americans to found 
schools: the desire to spread the faith, to retain the 
faithful, to maintain ethnic boundaries, to protect a 
privileged class position, to succor the helpless, to 
boost the conununity or sell town lots, to train 
workers or craftsmen, to enhance the virtue or 
marriageability of daughters, to make money, even 
to share the joys of learning.^ 

Population growth and density created new 
strains on schools' capacity to provide mass educs:- 
tion.^^ According to census statistics, public school 
enrollments grew from 6.8 million in 1870 to 15. 
million by 1900. By the turn of the century, ahno.'t 
80 percent of children aged 5 to 17 were enrolled in 
some kind of school^^ Mass public education could 
no longer be viable without fundamental ^n.^titu- 
tional adaptations. Expanding enrollment also 
placed new strains on the public till as public school 
began overshadowing private and charity schools. In 



direct expenditures, the percentage of total educa- 
tion spending attributable to die public schools grew 
fiom less tiian one-half in 18S0 to more than 80 
percent in 1900.*^ In terms of foregone income as 
well, the costs were impressive: the income that 
students aged 10 to 15 would have earned were they 
not in school increased from an estimated nearly $25 
million in 1860 to aknost $215 million in 1900.^^ 
Not surprisingly, this spending inevitably led to calls 
for evidence that the money was being used wisely. 

The size and concentration of the growing student 
population increased die taxpayers' burden and 
created new institutional demands for efficiency 
similar to those that governed the evolving nature of 
many American institutions. One way schools could 
demonstrate sound fiscal practice was by organizing 
themselves according to principles of bureaucratic 
management. ''Cmcial to educational bureaucracy 
was die objective and efficient classification, or 
gradmg, of pupils.***^ According to Henry Barnard, 
a prominent figure in the common school move- 
ment, it was not only inefficient, but also inhumane, 
to fill a classroom with children of widely varying 
ages and attainment}^ On tiiis assumption, the 
mid- 19th century reformers sought additional infor- 
mation that would make the classification more 
rational and efficient than the prevailing system of 
classification, based primarily on age. They tumed 
their attention toward achievement tests. 

The result was one of many ironies in the history 
of educational testing: the classification and group- 
ing of students, essentially a Prussian idea, became 
a pillar in the public school movement that was an 
American creation. No less an American educational 
statesman than Horace Marm, who saw universal 



^David lyack and Elisabeth Hansot, Managers of Virtue: Public School Leadership in America, 1820-1980 (New York, NY: Basic Books, 1982). 
p. 30. Sec also Kate. op. cit., footnote 2, p. 131. Kate writes that: . . the duty of the school was to supply that inner set of restraints upon passion, that 
bloodless adherence to a personal sense of right*^, which would counteract and so reform the dominant tone of society.** 

i^or a more detailed analysis of the shifts £rom rural to ufban education, see. e.g.. lyack. op. cit.. footnote 4. Also, sec Michael B. Katz, Class, 
Bureaucracy, and Schools (New York, NY: Pneger. 1972). 

"Bureau of the Census, op. cit. footnote 5, p. 369. See also Tyack, op. cit.. footnote 4, p. 66. who cites a report by W.T. Harris with similar data. 

*2iyack and Eansot. op. cit.. footnote 9. p. 30. 

>3iyack, op. cit.. footnote 4. pp. 66-67. 

>^Ibid.. p. 44. emphasis added. It is worth recallipg that the early esqwnents of bureaucracy spoke of its formalism— manifest in cUssification systems 
of the type discussed here—in positive terms, i.e.. as an improvement over earlier forms of organization that were at once less fair and less efficient. 
Sec. e.g.. Max Weber. The Theory of Social Economic Organization, edited and translated by A.M, Hendenon and T. Parsons (New York, NY; 
MacMilllan Publishing Co.. 1947). The appeal of tests as both fahr and efficient tools of management is a main theme in this chapter. 

i^iyack, op. cit.. footnote 4. p. 44. emphasis added. Bamard*s lifelong commitment to school improvement for the masses, coupled with his belief 
the importance of conserving the social and economic status of the privileged classes, personifies an important aspect of the American experiment 
E RXC education. See also Merle Curti, The Social Ideas of American Educators (Piiterson, NJ: Pageant Books. Inc.. 1959). pp. 139-168. 
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education as the ^^great equalizer*^ and who had a 
. . total faith in the power of education to shape 
the destiny of the young republic/ supported the 
highly structured model of schools in which students 
would be sorted according to their tested profi- 
ciency.*'' Thus, as early as the nud-19th century, 
there existed a belief in the role of testing as a vehicle 
to classify students ex ante, commonly viewed as a 
necessary step in providing education. Also emerg- 
ing during this period was an interest in uses of tests 
ex post: to monitor the effectiveness of schools in 
accomplishing their purposes. Visionaries like Mann 
saw testing as a means to educate effectively; 
administrators, legislators, and the general public 
tumed to tests to see what children were actually 
learning. 




Photo crttdits: FranoQS B, Johnston 

Teachers have always assessed student performance directly. 
These photos were taken circa 1 899 for a survey of Wiashlngton, 
DC schools. 



In fact, it was during Horace Mann's tenure as 
Secretary of the (State) Board of Education that 
Massachusetts became the site of ^^ . . the first 
reported use of a written examination . . . after some 
harassment by the State Superintendent of Instruc- 
tion about the shortcomings of the schools. . . 
From its inception, this formal written testing had 
two purposes: to classify children (in pursuit of more 
efficient leaming)^^ and to monitor school systems 
by external authorities. Under Mann's guidance, the 
State of Massachusetts moved from subjective oral 
examinations to more standardized and objective 
written ones, largely for reasons of efficiency. 
Written tests were easier to administer and offered a 
streamlined means of classifying growing numbers 
of students. 



i^remin* op. cit., footnote 2» pp. 8-9. 

i^Katz, op. cit.( footnote 2, pp. 139-140. 

i^Resnick^ op. cit.» footnote 3» p. 179, emphasis added. 

^^ack, op. cit., footnote 4, p. 45. lyack notes that classji!cati^>n preceded standard examinations: *\ . . the proper classification was only the 
beginning. In order to make the one best system work^ the schoolmen aiso had to design a uniform course of study and standard examinations.* * But 
he does not describe the criteria for ciissiflcation used prior to the standard examinations, which would be important to analyze the comparative fairness 
of formal and infonnal classification 'iy stems. It appears, though, always to have Involved some type of proflciency testing, the difference being between 

looser and more subjective classroom-based tests and the more fomial externally administered tests. 
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It is important to point out what **standardiza- 
tion** meant in those days. It did not mean **nonn- 
referenced'' but rather that . . the tests were 
published, that dkections were givt^ for administra- 
tion, that the exam could be answered in consistent 
and easily graded ways, and that there would be 
instructions on the interpretation of results."^ The 
model was quite consistent with the assumed virtues 
of bureaucratic management. The efficient flow of 
information was not unique to education or educa- 
tional testing; it was becoming a ubiquitous feature 
of American society .^^ 

Perhaps more important, though, was the evolv- 
ing role of testing as a vehicle to ensure fairness and 
evenhandedness in the distribution of educational 
resources: one way to ascertain whether children in 
the one-room rural schoolhouse were receiving the 
same quality of education as their counterparts in 
the big cities was to evaluate their learning through 
the same examinations. Thus, standardized testing 
came to serve an important symbolic function in 
American schools, a sort of technological embodi- 
ment of principles of fairness and universal access 
that have always distinguished American schools 
from their European and Asian counterparts. As the 
methods of testing later became increasingly quanti- 
tative and "scientific" in appearance, the tests 
gained from the growing public faith in the ability of 
science and rational decisionmaking to better man- 
kind. 

But Mann had other reasons for introducing 
standardized testing. He had been engaged in an 
ideological battle with the Boston headmasters, who 
perceived him as a ^'radical.'' This disagreement 
reflected a wider schism in the Nation between 
reformers like Mann who believed in stimulating 
student interest in learning through greater emphasis 
on the **real world,'* and hard-'liners who believed 



in discipline, rote recitation, and adherence to 
texts.^^ Although Mann and his compatriots eventu- 
ally won, setting American public education on a 
unique historical course, one of their more potent 
weqions in the battle was one that might today be 
associated with a hard-line, top-down approach to 
school reform: when two of Mann's allies were 
cqppointed to examine the status of the grammar 
schools, . , they gave vmtten examinations with 
questions previously unsown to the teachers [and] 
. . . published a scathing indictment of the Boston 
grammar schools in their annual report 

The Logic of Iv^sting 

The fact that the first formal written examinations 
in the United States were intended as devices for 
sorting and classifying but were used also to monitor 
school effectiveness suggests how far back in 
American history one can go for evidence of test 
misuse. The ways in which these tests were used for 
monitoring was logical: to find out how students and 
their schools are performing, it made sense to 
conduct some sort of external measurement process. 
But the motivation for the standardized examina- 
ticms in Massachusetts was, in fact, more compli- 
cated and reveals a pattem that would become 
increasingly familiar. The idea underlying the imple- 
mentation of written examinations, that they could 
provide information about student learning, was 
bom in the minds of individuals already convinced 
that education was substandard in quality. This 
sequence — perception of failure followed by the 
collection of data designed to document failure (or 
success) — offers early evidence of what has become 
a tradition of school refomi and a truism of student 
testing: tests are often administered not just to 
discover how well schools or Idds are doing, but 
rather to obtain external confirmation — ^validation — 
of the hypothesis that they are not doing well at all.^ 



^esnick, op. cit.» footnote 3. p. 179. 

2'Gtorge Madaus, for example, writes that tbe movement toward standardization and conformity began in 1815 with efforts in the Army Ordnance 
Departsnent to develop **. . . administrative* communication, inspection, accounting, bureaucratic and mechanical techniques that fostered conformity 
and resulted in the technology of interchangeable parts . » » (and that] these techniques » . . were well known throughout tb^ textile milht and mactUnes 
shops of New England when Horace Mann introduced the standardized written test. . . George Madaus, ''lasting as a Social Ibchnology,** 
unpublisbed monograph. Inaugural Anni^ Boisi Lecture on Education and Public Policy, Boston College, Dec. 6, 1990, pp. 26-27. See also Katz, op. 
cit»i footnote 2, pp. 5-1 1, for an account of the dramatic changes hi the structure and noanagetnent of American business during Mann's lifetime. 

22See Katz. op. cit., footnote 10, pp. 1 15*153, for a fuller discussion of the origins and ramifications of this ideological struggle. 

^Md,, p» 152. See also Madaus, op. cit., footnote 21. 

^^Allhottgh testing was not yet considered a scientific enterprise (that would come later in the century, with the emergence of psychology and the 
concq)tsof tnental measurement — see below), the logic of its application had traces of the hiductive iDodel: from empirical observations of the schools, 
to hypotheses explaining those observations, to the more systematic and less anecdotal collection of data in order to test the hypotheses. For a physicist*8 
views on the basic fallacies in mental ^'measurement,** however, see David Layzer, "Science or Superstition? A Physical Scientist Looks at the IQ 
,y itroversy/* The IQ Controversy: Critical Readings, NJ. Block and Gerald Dworkin (eds.) (New York, NY: Pantheon Books, 1976). pp. 194-241. 
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The use of fomial, written achievement tests in 
Massachusetts (and soon afterwards in many other 
places), as ahready emphasized* was motivated 
largely by administrative concerns.^ The tests 
themselves often focused on a rather narrow set of 
outcomes, selected principally to put the headmas- 
ters in the worst possible light. There was a profound 
mismatch between the content covered in those early 
achievement tests and the objectives of conmion 
schooling those tests were intended to gauge. Given 
the schools* broad democratic agenda, and given the 
environment of demographic and geogr^hic shift in 
which the agenda was to be earned out, the estimation 
of educational quality by a "...test of thirty 
questioiis on the subjects scheduled for study during 
the year . . . given to about half the eighth giade, one 
thousand students/*^ is a telling early example of 
the limitations of tests in measurins the range of 
knowledge students acquire during a school year. 

From their inception, written achievemmt tests 
were among the more potent weapons of reform of 
teaching and school administration. For example, 
Samuel Gridley Howe, an ally of Mann, looked to 
tests to provide '\ . . a single standard by which to 
judge and compare the output of each school^ 
'positive information in black and wnite/ [in place 
of] the intuitive and often superficial written evalua- 
tion of oral examinations/*^^ 

The tests Mann and Howe encouraged covered a 
nanow range of school material; there was no 
attempt to link students* test performance with 
specific features of school organization or peda- 
gogy; and the schoolmasters usually selected which 
students took the tests.^^ But these technical issues 
did not interfere with the use of test results as a basis 
for reform. Mann, for one, successfully convinced 



his fellow Bostonians that the tests were able to 
' \ , . determine, beyond appeal or gaiosaying, whedier 
the pupils have been faithfully and competently 
taught.**^ Ibachers, for their part, went along with 
the testing as long as they saw it as a way to wield 
power over their students.^^ 

Effects of Test Use 

Not surprisingly, soon after the first application of 
tests came criticisms that have also become a steady 
presence in school Ufe. First, there was public 
amazement at the poor showing of the test-takas: 
''Out of 57,873 possible answers, smdents answered 
only 17,216 correctly and accumulated 35,947 errors 
m punctuation in the process. Bloopers abounded: 
one child said that rivers in North Gu'olina and 
Ibrmessee run in opposite directions because of 'the 
will of God.' Second, it was feared that the tests 
were driving students to leam by rote: . . . [according 
to Howe] they could give the date of the embargo but 
not explain what it did.**^^ 

Nevertheless, test use continued, and firom the 
earliest s^plications, test use raised key questions. 
Consider, for example, that the miain beneficiaries of 
test information were not the teachers and principals, 
who might have used it to change aspects of their 
specific institutions, but rather State-level policy- 
mak^s and administrators. Thus, while there mig^t 
have been a casual acceptance of the principle that 
tests could provide information necessary to effect 
change, there was apparently much less agreement — 
or perhaps just simple naivete — as to how and where 
the changes would be initiated. ' 'The most important 
reported result, an unintended one from the stand- 
point of the [Boston] school committee, was to make 
city teachers and principals accountable to supervi- 
sory authority at die State level.' Tfests became 



^Schools were not alone in their growing admiratioii for quantification. Prison reformers, abolitionists, and others were also fond of statistics. For 
a lucid discussion of the reverence for science atid quantitative methods, which would peak at the turn of the century, see Paula S. Pass, **Tbe IQ: A 
Cultural and Historical Frameworic,** American Journal of Education, vol. 88, No. 4, August 1980, pp. 431-458. 

26Resnick, op. cit., footnote 3, p. 179. 

^ack» op. cit., footnote 4, p. 35, emphasis added. 

^*'Even within the grade, [the Boston test] was not a fair sample of students, since the schoolmasters were free to choose who would take the test.** 
Resnick, op. cit., footnote 3, p. 179. 

^Quoted in Paul Chq>man, Schools as Sorters: Lewis M. Terman, Applied Psychology , and the Intelligence Tesdng Movement, 1890-1939 (New 
Yoik, NY: New Yorit University Press, 1988), p. 33. 

30Robert Hample, University of Delaware, personal conununication, Nfay 1991. 

3iiyack, op. cit, foomote 4, p. 35. According to lyack, Howe knew how **abfitnise and tricky** the test items were, twt thought it was a fair basis 
for comparison of students nonetheless. Oiven the reference to punctuation errotSi it seems that the tests included at least some written work; in any event, 
we know that multiple choice was not invented until se vend decades later, which suggests that test format is not the sole determinant of content validity, 
fairness, or the tendency to leam. 

Q ^esnick, op. cit, footnote 3, p. 1 80, emphasis added. 
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important tools for education policymakers, despite 
their apparently limited value to teachers, students, 
and principals. 

A related development offers yet another illustra- 
tion that current problems in educational testing are 
not all new. Although the written examinations were 
intended to provide information about schools and 
students, that information was not necessarily meant 
to become a basis for comparisons. Yet that is 
quickly what happened, as illustrated in the case of 
examinations used for high school admission: Al- 
though only a minority of students took the [standard 
sliort-answer] exam, performance [on the exam] . . . 
could function, within the larger communities, to 
compare the performance of classes horn different 
feeder schools. ''^^ 

The case cited in this example points to a 
pervasive dilemma in the intended and actual uses of 
tests. On the one hand, information about student 
performance was understood to be essential as a 
basis for organizing classroom leaming and judging 
its output; on the other hand, once the information 
was created, it was quickly appropriated to uses for 
which it had not been designed— specifically, to 
comparisons among schools and districts. The fact 
that the jurisdictions were different in so many 
fundamental ways as to render the com|)arisons 
virtually meaningless did not seem to matter. 
Nevertheless, by the 1870s many school leaders 
were beginning to question the comparisons: . . a 
careful observation of this practice for years has 
convinced me that such comparisons are usually 
unjust and mischievous.' At the s?^e time, there 
was widespread agreement that . . the classroom 
was part of the production line of the school factory 
[and that] examinations were the means of judging 
the value added to the raw material • . . during the 
course of the year.'*^^ 

In the latter part of the 19th and early 20th 
centuries, changing demography would continue to 
influence school and test poUcy. Other factors would 
also begin to play a role: the development of 
psychology and ''mental measurement'' as a sci- 
ence, and the increasing influence of university and 
business iiUerests on performance standards for the 



secondary schools. These are the main topics in the 
next section of the chapter. 

Science in the Service of 
Management: 1875 to 1918 

During the period fiom 1875 to the end of World 
War I, the development and administration of a 
range of new t^iSting instruments — ^from those that 
sought to measure mental ability to those that 
attempted to assess how well students were prepared 
for college — brought to the forefront sevend critical 
issues related not only to testing but to the broader 
goals of American education. First, as instruments 
that were designed to discern differences in individ- 
ual intelligence became available, the concept of 
classifying and placing students by ability gained 
greater acceptance, even among those who espoused 
the democratic ideals of faimess and individuality. 

Second, as research on mental measurement 
continued, it gave rise to new debates about the role 
of heredity in determining intellectual ability and the 
effects of education. Some theorists used the results 
of intelligence and aptitude tests to support claims of 
natural hierarchy and of racial and ethnic superior- 
ity. 

Third, mirroring the structural changes occurring 
in businesses and other American institutions, 
school systems reorganized around the prevailing 
principles of efficient management: consolidation of 
small schools and districts, classification of stu- 
dents, bureaucratization of administrative responsi- 
bilities. Within these new arrangements, tests were 
viewed as an important efficiency tool. 

Fourth, by the end of World War I, standardized 
achievement tests were available in a variety of basic 
subjects, and the possibilities for large-scale group 
testing had been demonstrated. The results of these 
tests gave reformers (including college presidents) 
ammunition in their push for improven^^nts in 
educational quality. 

Fifth, ihe implementation of mass testing in 
World War I ushered in a new era of educational 
testing as well. 



^^Ibid. F6r an indepth study of the ''ole of teats and other criteria in admissions decisioL$ at Philadelphia's Central High School see David F. Labaree, 
The Making of an American High School: The Credentials Market and the Central High School of Philadelphia, 1838-1939 (New Haven. CT: Yale 
University Press, 1988), especially chs. 3 and 4. 

^^Emerson l^liite (an early leader in the National Education Asscciation), quoted in Tyack» op. cit, footnote 4, p. 49. 

Q ^^Tohn Philtxick, quoted ki ibid., p. 49, emphasis added. 
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Issues of Equity and Efficiency 

llie analysis in the preceding section of this 
chapter raises a perplexing question aboui tLe role of 
testing in American education: how could the 
emer giag American and democratic theory of educa- 
tion be reconciled with standardized tests that 
covered, at best» a small portion of what schooling 
was supposed to accomplish, and, at worst, were 
used in ways that violated basic democratic princi- 
ples of fairness? Part of the answer in the early years 
of testing lay in the role of curriculum in the public 
school philosoj^y. Horace Mann, for example, was 
inclined to accept the usual list of reading, 
writing, spelling, arithmetic, English grammar, and 
geogr^qphy, with the addition of health education, 
vocal music (singing would strengthen die lungs and 
thereby prevent consumption), and some Bible 
reading. ' Thus, it might be argued that one reason 
Mann favored the formal examinations was that they 
signaled the importance of learning the m^jor 
subjects, which, in his view, was the first stq> toward 
achieving the broader goals of morality, citizenship, 
and leadership. Learning the major subjects was a 
necessary — ^if insufficient — condition for education 
writ large*^^ 

Another factor was that because standardized tests 
were new, there was no established methodology for 
designing them or judging whetha test scores 
accurately reflected learning. Furthermore, school 
reformers seemed relatively unconcerned that em- 
phasizing the basics might compromise the broader 
objectives of schooling. Generally they viewed the 
basics as just that: the necessary buildhig blocks on 
which the broader objectives of education could be 
erected. 

If that explanation helps resolve the curious 
acceptability of short tests as proxies for complex 
educational goals, it does not offer any obvious clues 
to the paradox that the use of tests to track students 
had its roots in the movement to universalize and 
democratize education. Again, Marm's thinking on 
the subject can shed some light. Although '*Mann 



was one of the first after Rousseau to argue that 
education m groups is not merely a practical 
necessity but a social desideratum/'^^ he had an 
equally powerful belief in individuality, Mann*s 
answer was to tailor lessons in the classroom to meet 
the needs of individual children: '\ , . children differ 
in temperament, ability, and interest , , / ' and need 
to be treated accordingly.^^ From here, then, it was 
not a far leap to embracing methods that, because 
they were purported to measure those differences, 
could be used to classify children and get on with the 
educational mission. 

Mann was not alone. The American pursuit of 
efficiency would become the hallniark of a genera- 
tion of educationists, and would create the world's 
most fertile ground for the cultivation of educational 
tests. 

An Intellectual Bridge 

Some social scientists have characterized mental 
measurement — a branch of psychology that blos- 
somed during the late 19th and early 20th centuries 
and prefiguit^ modem psychological testing — as 
^\.*the most important single contribution of 
psychology to the practical guidance of human 
affairs.* *^ Psychological testing was able to flourish 
because of its appeal to individuals of nearly every 
ideological stripe. It was not just the hereditarians 
and eugenicists who were attracted to such concepts 
as ^^intelligence*' and the * 'measurement'* of men- 
tal ability; many of the early believers in the 
measurement of mental aiid ps^ychophysical proc- 
esses were progressives, egalitarians, and communi- 
tarians committed to the bett^Tnent of all mankind* 

Mann, for one, embraced phrenology — an ap- 
proach to the assessment of various cognitive 
capacities based on physical measurement of the 
size of areas of the brain — ^without reservation, 
joining the ranks of such advocates as Ralph Waldo 
Emerson, Walt Whitman, William EUery Channing, 
Charles Sunmer, and Henry Ward Beecher, as well 



3«Crciniii, op, cit. footnote 2, p. 10. 

^Thc belief Uiat learned persons were tftter, in the moral sense, has been pervasive throughout the history of American education. See. e.g., Curti. 
op, cit,, footnote 15, A major figure in the measurement of abihly and achievement, Edward Tborndike, produced empirical results showing the high 
correlation between intellectual attainment and morality. See, e.g,i lyack and Hansot, op. cit., footnote 9, p. 156, 

3«Cremin, op, cit, footnote 2, p. 11, 

39njid, 

^ ^^Lce Cronbach, **Five Decades of Public Controversy Over Mental Itsting,** American Psychologist, January 1975. 
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as a host of rsspected physicians.^^ Phrenology 
attributed good or base character traits to differences 
in physical endowments; Mann and others saw in 
this doctrine a persuasive rationale for education as 
a means of cultivating every individual's admirable 
propensities and checking his coarser ones. One 
might say, then, that phrenology symbolized to 
Mann a unique chance to mobilize siqiport for social 
intervention/^ 

Phrenology was a methodological bridge from 
crude comparisons based on written achievement 
examinations, to measures that were at once more 
scientifically rigorous and more sensitive to innate 
differences in ability The principal intelligence 
researchers whose worie would ultimately be trans- 
lated into the American science of mentd testing — 
Gaiton, Wundt, and Binet— had each dabbled in 
phrenology before devising their methods for assess- 
ing human intelligence. 

Mental Testing 

In the late 19th century, European and American 
psychologists began independently seeking ways to 
corroborate and measure individual differences in 
mental ability. Sir Francis Galton in England and J. 
McKeen Cattell in the United States conducted a 
series of studies — mostly dealing with sense percep- 
tion but ''Dme focusing on intellectual aptitude — that 
may be said to marie the beginning of modem 
intelligence testing.^ It was Cattell, in fact, who 
coined the term ''mental test'' in a paper published 
in 1890. 

In an effort to trace the hereditary origins of 
mental differences, Galton conducted the first em- 



pirical studies of the heritability of mental latitude 
and developed the first mental test, although he did 
not call it that/^ Although the more extreme views 
of some of these early researchers have long since 
been repudiated, and although some veered off into 
distasteful and unsupportable conclusions about 
hereditary differences (see box 4-A), their work 
nevertheless stimulated interest in intelligence test- 
iug that persists today. 

The Rench psychologist and neurologist Alfred 
Binet also had a very strong influence on the 
development of intelligence tests in America and on 
their uses in schools, although not necessarily in the 
ways Binet himself would have liked. Empirically 
based definitions of intelligence and accounting 
explicitly for age were two of Binet's most impor- 
tant contributions to the science of mental testing. 
For Binet ^ 'intelligence' ' was not a measurable trait 
in and of itself, like height or weight; rather, it was 
only meaningftil when tied to specific observable 
behaviors. But what behaviors to observe? Answer- 
ing this question led Binet to bis second major 
insight: ability to perfomi various mental behaviors 
varied widi the age of the individual being observed. 
His research, therefore, consisted of giving children 
of different ages sets of tasks to perform; from ttieir 
performances he con9)uted average abUities--for 
those tasks — and how individual children compared 
on those tasks.'^ Neither the concept thai intelli- 
gence existed as a unitary trait, nor the concept that 
individuals have it in fixed amounts from birth, are 
attributable to Binet. Moreover, to Binet anci co- 
worker Theodore Simon, intelligence meant . . 
judgment, otherwise called good sense, practical 



^1 About Mann's attraction to phrtoology, historian Lawitoce Ciemin wrote: ' 'It reached for naturalistic explanation of human behavior, it stimulated 
much needed interest hi the problem of child health; and it promised that education could build the good society by iibprovlng the character of individual 
childjen. What a wonderful psychology for an educational refotmert*' Op. cit, footnote 2, p. 11 

^^Curti, op. cit., footnote IS, pp. UO-1 1 1. Michael Katz points out that . . to Mann and others of his time [inteUigence] meant ... a capacity that 
could be developed, not an innate limit on potential ... an important point because it shows that 'hiteliigence* is partly a social/cultural construcUon 
that we shouldn't reify '* Personal communication, Aug. 18, 1991. 

^^T^ i^istoiy of phrenology contains some amushig ironies. Franz Gall, for example, one of the founders of the discipline, had to suffer the 
embarrasrjient of having his own brain wei^ hi ''at a meager 1,198 grams,*' considerably Ughter than the brains of real geniuses like lUrgenev. For 
discu'vsion see Stephen Jay Gould, The Mismeasure of Man (New Yoit, HY: Norton* 198 1). p. 92. And Francis Galton, whose own phrenologist surmised 
that his . intellectual capacities are not distinguished by much spontaneous activity hi reUtion to scholastic affairs — * ' (Raymond E. Fancher, The 
Intelligence Men: Makers of the JQ Controversy (New York, NY: W.W. Norton A Co.. 1985), p. 24). was later credited with launchlog the science of 
individual differences and of mental testing. 

^Walter S. Monroe, nn Years of Educational Research, 1918-1927 (Urbana, IL: University of Ulinols, 1928), p. 89. 

^^Borrowlng methods of data collection and analysis from mathematics and astronomy, he also invented a statistical procedure that his student Kari 
Pearson would later turn hito what is still the most powerful tool hi the statisUcian's arsenal the correlation coemcieot 

^Had the United States* move to universal public schooUng begun hi ibe Ute 19th century, an J rx>t in the middle, it is likely that the first achievement 
tests (desciibed hi the first section of this chapter) would have been more focused on hinate abiUty and aptitude rather than on mastery of subjecU taug^it 
hi school. As will be shown below, ho ever, the strands of ability and achievement ulthnately did converge, largely due to the work of Tfcrman and 
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sense, initiative, the faculty of adapting one's self to 
circumstances. ... A person may be a moron or an 
imbecile if he is lacking in judgment; but with good 
judgment he can never be either.' '"^-^ These charac- 
teristics of the Binet-Simon tradition were altered 
when the concepts of mental testing were imported 
to the United Stat^^s. 

Several Americans revised the Binet-Simon scale 
and adapted it for use in the United States. Stanford 
Professor Lewis Itrman was perhaps the most 



influential and successful of the American mental 
testers. His 1912 revisions, called the Stanford 
Revision, caught on quickly and marked the begin- 
ning of large-scale individual intelligence testing in 
the United States.^ As discussed in box 4* A, the 
technology of intelligence testing in the United 
States — ^in particular the connection between test 
performance and age in the formation of intelligence 
scales — ^was directiy influenced by Binet; but the 
philosophy underlying the use and interpretation of 
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^^A. Binet and T Simon, The Development of Intelligence in Children, translated by B.S. Kite (Baltimore, MD: Williams and Wilkins, 1916), pp. 
42-43. F6r discussion of the Binet-Simon tradition in intelligence testing, see, c.g., Robert Sternberg, Metaphors of Mind (Cambridge, England: 
Cambridge University Press, 1990). 

^ Monroe, op. cit., footnote 44, p. 90. 
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the tests was inherited from Galton and his follow- 
ers. Several historians have noted the mixed lineage 
of American testing; one has summarized it elo- 
quently, noting tLat: 

... it was only as the French concern with personal- 
ity and abnormality and the English preoccupation 
with individual and group differences* as measured 
in aggregates and norms, wm superimposed on the 
older Gemian emphasis on laboratory testing of 
specific functions that mental testing as an American 
science was bom."*^ 

Testing in Context 

There is a tendency in the psychological literature 
to overstate the influence of Galton, Binet, and the 
other pioneers of mental testing on the demand for 
educational tests ^ong American school authori- 
ties. That demand grew from a range of social and 
economic forces that produced similar calls for 
efficiency and compartmentalization in the work- 
place. Interest in the application of tests undoubtedly 
would have arisen even without the hereditarian 
influences of Galton and others who thought human- 
kind could be bettered through gradual elimination 
of the subnormally intelligent.^ 

What was happening in the schools in the midst of 
these intellectual storms? For one thing, immigra- 
tion was becoming an even more dominant influence 
on American political and social thinking. By 1890, 
some 15 percent of the American population was 
foreign bom, and the quest for Americanization was 
continuing frill steam. These **new'' immigrants 
came from Southern and Eastern Europe (Austria, 
Hungary, Bulgaria, Italy, Poland, and Russia among 
others), and their numbers were beginning to over- 
take tiie traditional immigrants arriving from North- 
em Emope (Anglo-Saxons, French, Swiss, and 
Scandinavians). The effects on schools were stag- 
gering. 

These abrupt demographic shifts affected many 
aspects of American life, but schools had a unique 
charge to maintain order in a society undergoing 
massive change and fragmentation and to inculcate 
American democratic values into massive numbers 
of iimnigrants. ^^Just as mass immigration was a 
symbol for — even the embodiment of— cultural 



PtiQto cndh: TwmrB Cyrmmld, OlA $M 

Schools In America have played a central rde In preparing 
Immigrants for life In their new home. Challenged by the 
goals of educating nnasslve numbers of newcomers 
falriy and effldentty, schools relied heavily 
on standardized testing. 

disraption, education became its dialectical oppo- 
site, an instrument of order, or direction, of social 
consolidation.''^^ Because American schools were 
conunitted to principles of democratic education and 
universal access, instruments designed to bring 
order to schools without violating principles of 
fairness and equal access were extremely attractive. 

Indeed, standardized tests offered even more than 
that. For one thing, they held promise as a tool for 
assessing the current condition of education, a 
means to gather the data from which reforms for 
integrating the masses could be designed. In what 
was perhaps the first effort to blend objective 
evaluation with joumalistic-style muckraking, Jo- 
seph Mayer Rice conceived die idea of giving a 
uniform spelling test (and later, arithmetic and 
language tests) to large numbers of pupils in selected 



4^as$, op. cit.* footnote 25, p. 433. See also Cieinia, op. cit., footnote 2, p. ICQ. 

^Sec, e.g., Qould, op. cit.. footnote 43. for a fuller discussion of the role of testing in the eugenics movement and how it influenced public ooUcv 
in the 1920s and 19308. *^ ^ ^ 

Q ''Pass* op. clU footnote 25, p. 432. 
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cities. His findings, published in 1892, were based 
on data he had collected on some 30,000 children, 
and documented the absence of a relationship 
between the time schools spent on spelling drills and 
children's performance on objective tests of spell- 
ing.^^ ^'In one study, [Rice] . . . found that [instruc- 
tional time] varied from 15 to 30 minutes per day at 
different giade levels . . . [but that] tests of student 
performance on a common list of words revealed 
that the extra IS minutes a day made no difference 
in demonstrated spelling abiUty/'^^ When Rice's 
results were presented to a major meeting of school 
superintendents in 1897, they were ridiculed; ulti- 
mately, however, a few farsighted educators con- 
curred with Rice's analysis.^ 

Managerial Efficiency 

Schools were not alone in their attempts to adapt 
to changing times. The following description of 
change in tiie railroad industry could just as well 
describe emerging trends in school administration: 

* . . it meant the employment of a set of managers to 
supervise . . . fimcticmal activities over an extensive 
geographical area; and the a[q>ointnient of an admin- 
istnuive conunand of middle and top executives to 
monitor, evaluate, and coordinate the work of 
managers responsible for the day-to-day operations. 
It meant, too, the formulation of brand new types of 
internal administrate procedures and accounting 
and statistical controls 

In otiier sectors of American enterprise, engi- 
neers, researcho-s, and managers were applying 
scientific principles to enhance efficiency. In agri- 
culture, for example, research and technology was 
transforming the nature and scale of farming* 
Progressive educators, who were familiar with the 
commercial precedents, * . conmionly used the 
increased productivity of scientific farming as aa 
analogy for the scientifically designed educational 
system they hoped to build.* *^ 



The newly evolving business organizations also 
employed modes of classification and bureauCTatic 
control that bore remailcable similarity to tiiose 
adopted by school systems as tiiey shifted from 
largely rural, decentralized organizations to urban, 
centralized ones. ^^Scientific management,*' a rela* 
tively late addition to the set of new business 
organizational principles invented around the turn of 
tlie century, was based on the proposition that man- 
agers could ascertain the abilities of their workers 
and assign tiiem accordingly to die jobs where they 
would be the most productive. 

Managerial efficiency was but one way in which 
business thinking coincided with school policy. The 
other principal point of convergence had to do with 
die demand for ^^skilled** labor. Just as division of 
labor according to ability was seen as a vehicle to 
improve productivity on the shop floor, classifica- 
tion and ranking of students was seen as a prerequi- 
site to their efficient instruction. The relationship is 
perhaps best illustrated by the statements of Harvard 
President Charles EUot, in 1908. Society, he said, is: 

. . . divided . . . into layers . . . [with] distinct charac- 
teristics and distinct educaticmal needs ... a tliin 
upper [layer] which consists of the managing, 
leading, guiding class * . . next, the skilled workers 
. . . third, the commercial class . . . and finally the 
thick fundamental layer engaged in household woric, 
agriculture, mining, quarrying, and forest work. . . . 
[The schools could be] * * . reorganized to serve each 
class ... to give each layer its own appropriate form 
of schooling.^ 

It was an obvious leap, then, for business execu- 
tives to join with progressives in calling for reform 
of schools along the corporate model Hierarchy, 
bureaucracy, and classification — all served by the 
science of testing — ^would become the institutional 
environme )t charged with producing educated per- 
sons capable of functioning in the hierarchical, 
bureaucratic, and classified world of business.^^ 



^^Haoey, op. cil, footnote 3, p. 600. 
^^ResDick, op. dt, footnote 3, p. 180. 
^Mooroe* op. cit, footnote 44, pp. 88*89. 

^^Alfred Chandler, The Visible Hand: The Managerial Revolution in American Business (Cambridge, MA: Harvard University Press, 1977), p. 87. 
Cbandler *8 description of changes in railroad management suggests another Analogy with school administntion. Daily rep<xt»— from conductors, agents, 
and engineers— detailed every aspect of railroad operations; these reports, along with information from numagers and depaitment heads, wem used to 
make day-to-day decisions wad, at the executive level, to compare the performance of opeiatiiig units with each other and with other raihoads <p. 103). 

^^^l^k and Hansot« op. cit., footnote 9, p. 157. 

^ack, op. cit, footnote 4, p. 129. 

^Fdr a critica] analysis oi testing and social/economic stratification in die United States, see, e.g., Clarence Karier, **lbsting for Order and Control 
O iS Coiporate Liberal State,* * in Block and Dworkin (eds.), op. cit., footnote 24, pp. 339-373. 
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The advocates of the corporate model of school 
governance, such as Stanford Education Dean EU- 
wood R Cubberley, argued that to manage effi- 
ciently, the modem school superintendent needed 
*^rich and accurate flows of information'' on enroll- 
ments, buildings, costs, student promotions, and 
student achievement.^^ Cubberley advocated die 
creation of ''scientific standards of measurement 
and units of accomplishment*' that could be i^lied 
across systems and used to make comparisons. 
Fulfilling this need for data, Cubberley maintained, 
would require new types of school emptoyees— 
efficiency experts **. . . to study mediods of proce- 
dure and to measure and test the ou^ut of its 
works'';^ a recommendation that indeed came to 
pass as large, urban systems hired cei s takers, 
business managers, and eventually evaluation ex- 
perts and psychologists. 

Achievement and Ability Vie 
for Acceptability 

Despite initial opposition from teachers, die use of 
achievement tests us instruments of accountability 
began to gain support. By 1914 die National 
Education Association was endorsing the kind of 
standardized testing diat Rice had been urging for 
two decades* The timing was exquisite: on one front, 
diere was die ''push** of new technology tfiat 
promised to be valuable to testing, and on die other, 
a heightened *'pull** for mediods to bring order to 
the chaotic sdiools. 

Two approaches to testing competed for domi- 
nance in the schools in die early 20th century. One 
had its antecedents in die intelligence testing move- 
ment, die odier in die more curriculum-oriented 
achievement testing diat grew out of Rice's exam- 
ples* 

Between 1908 and 1916, Edward Thomdike and 
his students at Columbia University developed 



standardized achievement tests in arithmetic, hand- 
writing, spelling, drawing, reading, and language 
ability* Composed of exercises to be done by 
studc^its, die arithmetic test was similar in format to 
die types of tests tradiuonally administered by 
teachers. Hie handwriting and conqposition tests, by 
contrast, consisted of sanq)les of handwriting and 
essays against which pupil performances were 
compared,^^ By 1918, diere were well over 100 
standardized tests, developed by different research- 
ers to measure achievement in the principal ele- 
mentary and secondary school subjects.^ 

Student achievement was not all that would come 
under die microscope of standardized assessment* In 
the first decade of die 20th century, following die 
advice of Cubberley and odier advocates of scien- 
tific management, * * leaders of die school survey 
movement examined and quantified virtually every 
aspect of education, from teaching and salaries to di' 
quality of school buildings.**^ Indeed, Thomdike's 
proclamation of 1918 — ''whatever exists at all 
exists in some amount** — ^formed the cornerstone of 
his educational measurement edifice.^ By 1922, 
John Dewey would lament die victory of the testers 
and quantifiers with these words: ^^Our mechanical, 
industrialized civilization is concerned with aver- 
ages, with percents* The mental habit which reflects 
this social scene subordinates education and social 
arrangements based on averaged gross inferiorities 
and superiorities/*^ 

Thomdike*s approach to achievement tests mir- 
rored in in^>ortant ways diat taken by reformers in 
Massachusetts some 70 years earlier: just as diey had 
reached a foregone conclusion about the quality of 
Boston schools before die first tests were given, 
Thomdike*s tests actually came after he had already 
decided diat die schools were failing. His 1908 study 
of dropouts, followed the next year by a remarkable 
statistical analysis conducted by Leonard Ayres, 



^^lyack and Hanaot, op. cit., footnote 9, p. 157. 

«*Uwood P. Cubbaly. Public School Administration (Cambridge. MA: The Riverside Press. 1916). p. 338. 
^^Monroe. op. cit.. footnote 44. p. 90. 

^femln, op. cit.. footnote 2. p. 187. A iq>ort by Walter Monroe in 1917 documented over 200 uuch tests. See Chapman, op. cit.. footnote 29. p. 34. 
^Hrhapmant op. cit.. footnote 29. pp. 34-35. 

^In later writ'tsgs. Thomdike vw more humble. example, he wiote: * 'Existing instnnnenU (for measuring intellect) rq)resent enonnous 
improvements over what was available twenty yean ago. but three fundamental defectt remain. Just what they measure is not known; how far it is proper 
to add. subtracts multiply, divide, and ccmipute ratios with the measures obtained is not known; Just what the measures obtained signify conoening 
intellect is not known. . . Edward L. Thomdike. B.O. Bregman« M.V. Cobb, and Ella Woodyard, The Measurement qf Intelligence (New Yoric, NY: 
Columbia University. Ibachers College. Bweau of Publications. 1927). 
O ^^lyack. op. cit. footnote 4. p. 198. 
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Grades 6, 7 and 8. 

Cify Sra/« D<it4 

pHpir$ Xame Agt... OmU . 

.Sr/io«/ TcQchtr 



OirtdlMt ftr GMmg tbe TttU, 



Inn not to ofon tki paptn thoM en the 
tiM Mporo, ^uciNf ono upon tht dttk of coch 
I —€k child All In tilt Monk sptcti ot tht (op 



Afttr tellinff tlit children not to onon Uh paptro atk thoM on the 

front MOU CO diiirlbttte tht 

popil in tht clau. Hnvt t». 

of this pofc. Thtn niakt ctoar tht foUowing : 

Imtmcllono to ht Ktnd hy Tcndwr and PnplU Toftthtr. 

This liUlt flvf-Mlnatt aamt Is flvtn to sot how qukkly and accu- 
rstely MplU can read alltnUy* To show what sort of gmme it is. let 
as md this: 



Btlow art glti 


A tht naniM of foar animaln. Draw a 


lint aroand tho m 


imo af Mch aniatal that is useful on tht 


farm: 


cow 


Uivr rat wolf 



This tBtrdst ttlla at to draw a lint arat..id tht word cow. No 
other answer Is right Bvta II a Unt la drawn nmkr tho word cow. 
iho tmrrlst la wrong • nnd ctanto nothing. Tht gnait wn s lstt of a lot 
ot Jast each S Mrcla oib aa H la wIm U study each oxorclso carefully 
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called attention to an alanning problem.^ For 
reasons that neither Thomdike nor Ayres professed 
to understand entirely, the schools were full of 
students who were not progressing. In New York 



Gty, for example, Ayres reported that 23 percent of 
the 20,000 children studied were above the normal 
age for their grade. 

Where could concerned educators of the time tiun 
for explanatioiis? It is useful to review in this context 
the staggering demographic changes of the time, a 
phenomenon that so utterly consumed the collective 
psyche that Thomdike, Ayres, or anyone else 
thinking about the schools could not have helped but 
try to cT^lain their findings in terms of the changing 
national origin of students* Between 1890 and 1917, 
the total U.S. population grew firom 63 million to 
over 100 million, largely as a icesult of immigration. 
During the same period, the population aged S to 14 
grew from just under 17 million to over 21 million; 
similarly, the public school enrollment rate climbed 
from about SO percent in 1900 to 64 percent in 1920, 
and average daily attendance went frrom 8 million to 
just under IS million.^'^ 

The effect: inunigration and population growth 
on the issues Thondike and Ayres grappled with, 
however, were somewhat surprising. While Ayres's 
initial research question — ^^Is the immigrant a 
blessing or a curse? "^-^eveals something about 
the anti-immigrant Zeitgeist, his answers, based on 
the data analysis he presented, revealed a healthy 
objectivity. Ayres concluded that: 

1. there was no evidence that the problems of 
students being above normal age for their 
grade or dropping out were most serious in 
those cities having the largest foreign popula- 
tions; 

2. . . children of foreign parentage drop out of 
the highest grades and the high school faster 
than do American children; 

3. ... there are more illiterates among the native 
whites of native parentage than among the 
native whites of foreign parentage;'*^' and 

4* ' "... the proportion of children five to fourteen 
years of age attending school is greater among 
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^See Leonard Ayres, Laggards in Our Schools: A Study of Retardation and Elimination in City School Systems (New York, NY: RuBsell Sage 
Foundatioii, Charities Publication Committee, 1909), p. 8. 

«7p^ analysis of the effects cf chiki labor laws on school attendance, see David Goldston, Histoiy Depaitment, University of Pennsylvania, "lb 
Discipline and lhach: Compulsory Education Enforcement in New York Gty, 1874-94,** unpublished monograph, n.d. 

^Ayres, op. cit, footnote 66, p. 103. 

^^Ibid., p. 1 15. Ayres did not dte the source for his illiteracy statistics, which he presumaMy collected himself. Census data suggest a somewhat 
different picture from the one presented by Ayres. In 1900, for exan^le, about 5 percent of the native v^iite population was estimated to be illiterate, 
as conqMtfed to ahnost 13 percent of the foreign bora Had included the census category ' 'Negro' ' (and other races), he might have found— as 
Q * the census— a staggering illitcfacy rate of 44 percent in 1900. See Bureau of the Census, op. cit., footnote 5, Series H 664-668, p. 382. 
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those of foreign parentage and foreign birth 
than among Ame* is/ ''^^ 

Finally, he concluded from his analysis that: 
. . in the country at large [the schools] reach the 
child of the foreigner more generally than they do the 
child of the native bom American/' which was a 
source of great humiliation to ^'national pride/ ''^^ 

Experimentation and Practice 

Although Ayres may not have been aware of it, his 
work actually vindicated the basic tenets of the 
achievement-oriented testers, who tended to focus 
on school curricula and the extent to which children 
were actually mastering the substantive content of 
schooling. Their approach to assessment was to 
develop quantitative and qualitative measures of 
student ''productions/' and the . . early versions 
of stand^dized tests were developed by public 
school systems, often in collaboration with univer- 
sity centers, to reflect the curriculum of the schools 
in a particular city/*^^ 

Ttiis approach to assessment recognized implic- 
itly that institutional factors were largely responsible 
for the sorry situation in the schools. Moreover, if 
school practices changed, then children's opportuni- 
ties for success would improve, and it was believed 
that the kind of information provided by the stand- 
ardized achievement tests could light the way to 
effective reform. 

Much to the frustration of the dedicated educators 
who had mounted them, the effects of school reform 
efforts were typically disappointing. In New York, 
for example, in 1922, nearly one-half of all students 
were above the normal age for their school grade, 
and there was enormous variability in ages of pupils 
in any given grade.*^^ 

This sort of experience did not dissuade educators 
from the idea of using tests to effect change, but 
rather persuaded many of them that poor student 
achievement stemmed from low innate ability. In 
other words, even the achievement tests of Thorn- 



dike were inadequate to measure— and remedy — the 
problems of schools, because those tests did not 
adequately measure basic intelligence. The state- 
ments of New Yoik Superintendent William Ettin- 
ger underscore the intrinsic appeal of the intelli- 
gence test model: 

. . . rapid advance in the technique of measuring 
mentid ability and accomplishments means that we 
stand on the threshold of a new era in which we will 
increasingly group our pupils on the basis of both 
intelligence and accomplishment quotients and of 
necessity, provide differentiated curricula, varied 
modes of instruction, and flexible promotion to meet 
the crying needs of our childrai.'^'^ 

Thus, for Ettinger and others, the achievement tests 
available at the time were still not standardized 
enough — they did not get at the root causes of 
difference in student performance. 

New York was not alone. Oakland, California, 
was the site of one of the first attempts at large-scale 
intelligence testing of students. During the 1917 and 
1918 academic years, 6,500 children were given the 
Stanford-Binet, as well as a new test written by 
Arthur Otis (one of Lewis Ibrman's students who 
would eventually be credited with the invention of 
the multiple-choice format''^). The experiment in 
Oakland was significant because it was one of the 
first attempts to use intelligence tests to classify 
students: ''Intelligence tests were used at first to 
diagnose students for special classes; later their 
adoption led to the creation of a systemwide tracking 

plan based bn ability The experiment with 

testing in Oakland . . . would*provide a blueprint for 
the intelligence testing movement after the war.**^^ 

The Influence of Colleges 

Another institutional force exerted pressure on the 
schools during this period. The university sector sent 
a clear message of dissatisfaction with the quality of 
high school graduates, and urged a retum to the high 
standards to which the elite colleges had been 
accustomed in earlier times. Many academic leaders 



''^Ayres, op. cit» footnote 66, p. 115. 
7ilbld.. p. 105. 

''^Edwaitl Hacrtcl and Robert Calfcc, ''School Achievement: Thinking About What to Tbst,'* Journal of Educational Measurement, vol. 20, No. 2, 
summer 1983, p. 120. 

73iyick, op. cit., footnote 4, p. 203. 

'^Sccch. 8. 



^6Chapmin,op.cit.,footnote29,p.56. 1 2fi 
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were attracted to the intelligence test as a filter in 
their admissions process. The President of Colgate, 
along with leaders of the Carnegie Foundation, the 
University of Michigan, Princeton, Lehigh, and 
other higher education institutions, argued that too 
many children were in college who did not belong 
there. 

As early as 1890, Harvard President Charles 
William Eliot proposed a cooperative system of 
common entrance examinations that would be ac- 
ceptable to colleges and profisssionai schools through- 
out the country, in lieu of the sq>arate examinations 
given by each school. The interest of Eliot and 
like-minded college presidents in a standardized set 
of national examinations went beyond their immedi- 
ate admissions needs. Their broader objective was to 
institute a consistent standard that could be used to 
gauge not only the quality of high school students' 
preparation, but also, by inference, the qu^dity of the 
high schools from which those students came. The 
ultimate aim was to prod public secondary schools 
to standardize and raise the level of their instruction, 
so that students would be better prepared for higher 
education. Eliot expressed consternation that . . in 
the present condition of secondary education one- 
half of the most capable children in the country, at 
a modest estimate, have no open road to colleges and 
universities/''^ 

Getting colleges and universities to agree on the 
subjects to be included and the content knowledge to 
be assessed in a common college entrance examina- 
tion was no easy task. Anticipating the minimum 
competency testing movement by ahnost a century, 
the opponents of a standard college entrance exami- 
nation voiced early concerns about whether these 
tests could lead to State examinations that would 
eventually be used for awarding degrees as well as 
college admission. 

Eventually the advocates of common exaixiina- 
tions were able to gamer enough support to form the 
College Entrance Examination Board in 1900. In 
1901, the first exanndnations were administered 
around the country in nine subjects. While in later 



years college admissions examinations would come 
to resemble tests of general intelligence, the early 
examinations of the College Board were closely tied 
to specific curricular requirements: . , the hall- 
mark [of the examinations] was their relation to a 
carefully prescribed area of content '''^^ 

Within a relatively short period of time, the 
College Board became a major force on secondary 
school curricula. The Board adopted the practice of 
formulating and publicizing, at least a year before a 
new examination was introduced, a statement de- 
scribing the preparation expected of candidates. 
Developed in consultation with scholarly associa- 
tions, these statements, in the opinion of one 
observer, . . became a paramount factor in the 
evolution of secondary school curriculum, with a 
salutary influence on both subject matter and teach- 
ing methods.'*^ This glowing assessment was not 
shared by all educators. By the end of World War I, 
many school superintendents shared the concerns of 
one California teacher who wrote the following to 
the Board in 1922: 

These examinations now actually dominate, con- 
trol, and color the entire policy and practice of the 
classroom; they prescribe and define subject and 
treatment; they dictate selection and emphasis. 
FuithCT, they have come, rightly or wrongly, to be at 
once the despot and headsman profersicmally of the 
teacher. Slight chance for continued {Hofessional 
service has that teacher who fails to ''get results'' in 
the ''College Boards,'' valuable and inspiring as his 
instructicm may otherwise be.^ 

World War I 

Army testing during World War I ignited the most 
rapid expansion of the school testing movement. In 
1917, Ibrman and a group of colleagues were 
recruited by the American Psychological Associa- 
tion to help the Army develop group intelligence 
tests and a group intelligence scale. This later 
became the Alpha scale, used by the Army to quickly 
and efficiently determine which recruits were capa- 
ble for service and to assign them to jobs.^^ 



^Much of the discussion of the early history of the College Board comes from John A. Mdentiiie, The College Board and the School Curriculum: 
A History of the College Boards Influence on the Substance and Standards of American Iducation, 1900-1980 (New YoA, NY: College Entrance 
Examination Boards 1987). Eliot is quoted on p* 3. 

^<From the autobiography of James B. Cunant, quoted in ibid., p. 21. 

^Claude M Fuess, quoted in lbld.» p. 19. 

»Ibid., p. 29. 

^ .'Monroe, op. cit., footnote 44, p. 95. /; l23 
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The administration of group intelligence tests 
during the war stands out to this day as one of the 
largest social experiments in American history. Prior 
to World War I, most intelligence tests had been 
administered to individuals, not large groups. In a 
period of less than a month, the Army's psycholo- 
gists developed and field tested an intelligence test. 
Almost as quickly, the Army began applying the 
tests to what today would clearly be cidled ''high- 
stakes decisions/' The Alpha tests, for the normal 
population, and the Beta tests, for the subnormal, 
both loosely structured after Binet's tests for chil- 
dren, were given to just under 2 million young Amiy 
men, and the results were used as the basis for job 
assignments. ''In short, the tests had consequences: 
in part on the basis of a short group examination 
created by a few psychologists in id)out a month, 
testee number 964,221 might go to the trenches in 
France while number 1,072,538 might go to offices 
in Washington/ 

The results from this testing were mixed. For one 
thing, validation s^tudies were less than conclusive 
and Army personnel (and others) criticized the 
validity of the tests. In one such study (the typical 
validation study used officers' ratings of soldiers' 
proficiencies as the outcome or criterion measure), 
correlations between performance on the Alpha test 
and officers' ratings were in the low 0.60s, and on 
the Beta test in the O.SOs.'^ The Army itself had 
mixed feelings about the testing program, and 
eventually it discontinued testing its peacetime 
force. 

One of the most important outputs of the program 
was the mass of data that could be mined by eager 



intelligence theorists. Some theorists reached partic- 
ularly controversial and inflammatory conclusions, 
most notably that 1) a substantial proportion of 
American soldiers were "morons,'^ which was 
presented as evidence that the American "stock'' 
was deteriorating; and 2) in terms of test perform- 
ance, the ranking of intelligence was white Ameri- 
cans first, followed by Northern Europeans in 
second place, with immigrants fi:om Souttiem and 
Eastern Europe a distant third. These findings helped 
fuel the woric of a small but vocal gronn of 
eugenicists, such as Carl Brigham, who adv >ated 
"... selective breeding [to create] a world in which 
all men will equal the top ten percent of present 
men. . . ."^ This reasoning contributed to congres- 
sional debate over restrictive immigration legisla- 
tion.^ 

Testing Through World War II: 
1918 to 1945 

Overview 

Several themes emerged during the period of 191 8 
to 1945 that continue to be relevant to testing policy. 
A basic lesson of the period was that in a society 
constantly straggling with tradeoffs between equity 
and efficiency, an institution that claims to serve 
both objectives at once commands attention. If 
achievement and intelligence tests had been viewed 
purely in terms of more efficient classification, they 
would have undoubtedly encountered even more 
public opposition than diey did. But because the 
tests were promoted as tools to aid in the efficient 
allocation of resources according to principles of 



•^iwk. op. cit., footnote 4» p. 204. 

0 J validity coefficknt do^ u>t metn that predictions of soldiers* future perfonouuice based on their test scores were right about one-half the time. 
Rather* it suggests a linear and nonrandom relationahip (0 correlatiou would signify complete landomneas) between the score and the criterion variable. 
It should be noted that today's tests used for selection and placement (e.g., the Scholastic Aptitude Ifest for college admissions or the Oeneral ^tude 
Ibst Battery for eiiq>kyyn)ent) have predk^ve validities (con^^ 

Fairness in Employment TbsUng (Washinigton, DC: National Academy ftess, 1989). For a critique of the policy to use employment tesU with low 
predictive validity* see, e.g., Henry Levin« ' 'Imes of A^preement and Contention in Employment IbstlAg/ * Jouma! cf Vocational Behavior^ vol. 33, 
No. 3, December 1988, pp. 398-403. 

Since the days of the Army AJpha, die ptychometiic quality of tests used in screen^ 
evidence that the criterion measures for die Army Alpha were ptycbometricalty sound, or that other test features would [lass today's scicntilk muster. 
Steplien Jay Oould made this point quite forcefully in his book The Mismeasure of Matt (op. cit , footnote 43): his exp^iment tlMt demonstnted hem 
Harvard students, hardly an illiterate lot, porfomked on the Beta version of the test— designed for recruits "^fbo could not read— ii often cited as prima 
facie evidence of the low psychometric quality of die Army intelligence tests. 

^Karier, op. cit, footnote 58, p. 347. Some of the early faidi in eugenics was fueled by die writing of Hii Ooddard, as described in Gould (op. dt., 
footnote 43), Fancha (op. cit„ footnote 43), and odier histories. However, it is important to note that Ooddard later recanted his findings concerning 
die allegedly low intelligence levehi of immignuUs and Black Americans, and publicly qiologized for the efifocts those findings might have had. For 
discussion, see Carl Degler, In Search of Human Nature (Cambridge, England: Oxford University Press, 1991 ). 

9p"^See, e.g.. Oould, op. cit., footnote 43. 
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''meritocracy/' they appealed to a wide spectrum of 
the American polity .^^ 

Second, the development of mental measurement- 
part of the broader emergence of psychology as a 
bona fide science— coincided with profound demo- 
graphic and geogri^hic shifts in American society. 
New educational testiog models were cuMvated in 
this crossroads of technological push (ps> iiology) 
and social pull (the need to reform schools and 
schooling). Windows of opportunity of this sort are 
rare in history; how society capitalizes on them can 
have deep and long lasting impacts. 

Third, it is in^rtant !o distinguish technology of 
testing from ideology of test use. The history of 
testing in America suggests that political, social, and 
economic uses for testing can substantially exceed 
the technical limits imposed by test design.^^ 

Fourth, there appears to be a trend from highly 
specific and curriculum-oriented achievement tests 
toward tests of increasingly general cognitive abil- 
ity* This trend has historically been associated with 
attempts to extend principles of accountability to 
larger and larger jurisdictions, i.e., from schools to 
districts to States and ultimately to the Nation as a 
whole. As shown by the developments in college 
admissions testing, for example, the move toward 
consolidation of admissions criteria and the per- 
ceived need to influence secondary school education 
nationwide led eventually to the adoption of a test 
designed explicitly to assess aptitude, which later 
was renamed ''developed ability,'* rather than 
achievement of specific curricular goals. This trend 
has been reinforced, historically, by several other 
factors: 

• the incentives for efficiency, made particularly 
important by the commitment to assess massive 
numbers of students over many different leani- 
ing objectives; 



• the recurring interest in using tests as a way to 
mitigate the cultural differences in a heteroge- 
neous population; and 

• the tendency to shift blame for the quality of 
education, i.e., to explain low achievement in 
terms of low innate ability of students rather 
than in terms of poor management and instruc- 
tion. 

Fifth, growth in the use of standardized tests often 
coincides with heightened demand for greater unifi- 
cation in curricula. Although the history does not 
demonstrate a fixed direction of causality, it does 
suggest the following sequence: initially there is 
growing recognition that many schools are not doing 
as well as they should; next diere is awareness of a 
fragmented school system which, if nothing else, 
makes it difficult to obtain systematic information 
about what is really happening in classrooms; and 
finally there is a simultaneous push for standardiza- 
tion in measmement—to facilitate reliable con^ari- 
sons and standardization of instraction — to remedy 
the fi:agmentation. 

A Legacy of the Great War 

Despite the questionable foundations and effects 
of the Army's intelligence testing experiments, the 
terrain had been plowed, and on the conclusion of 
World War I, schools were only too willing to 
partake of the harvest. At long last, it seemed to 
many school leaders, there was a technology that 
could be deployed in the service of elevating the 
quality of education provided to the Nation's youth. 
^ ^Better testing would allow [the schools] to paform 
their sifiting scieutifically,''^ i.e., to claf jify chil- 
dren according to their iimate abilities and in so 
doing, protect the slow witted from the embarrass- 
ments of failure while allowing the gifted to rise to 
their rightful levels of achievement. 

World War I, in effect, set in motion the process 
that would result — in an incredibly short time— in 



fJl^.^'^A ^^^^^ "^""^^^y^^^^^^^^"^^ 

J870'2033: An Essay on Education and Equality (London, Bogland: Thames and Hudson, 1958). Pnilt Pass notes that: ••The 10 estaUished a 
meritocratic standaid which seemed to sever ability from the confusions of a changiQg time and an incitasingly divetse population, provided a means 
fw the indiWdual to continue to earn his place in so^ 

the mass while locating social talent** Pass, op. dt, footnote 25, p. 4^^ ^ ^wwuwiw 

<^^an Michael Katz disigiees: 

I can't agree with . . /iie point . . . Aat theie's a difference between the purpoje of testing (or the technology or science of testing) and the uses 
to which testiQg U put . . . TO^ argument creates a false dichotomy which seems to renect a mdve view of scientific and technological 
dcvdopmai as self-contained and unaffected by their context. CJcarly, this wasn't so; psychology and testing as research enterprises were 
products of time and place with all duit implies. 
Katz, op. cit, footnote 42. 

tJ^^^^ footnote 4, p. 206. 
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national intelligence testing for American school 
children. By the end of the first decade after the war, 
standardized educational testing was becoming a 
fixture in the schools. A key development of the 
period was the publication of test batteries, which 
**. . • relieve[d] the teacher or other user from the 
task of selecting the particular tests to be used . . . 
[and which provided] a method for combining the 
several achievement scores into a single measure/' 
Many testmakers included detailed instructions and 
scoring procedures for using achievement and intel- 
ligence tests in conjunction with each other, in order 
to gauge ' ' . . . how well a school pupil is capitalizing 
his mental ability 

The proponents of testing were extraordinarily 
successful: * * ... one of the truly remarkable aspects 
of the early history of IQ testing was the rapidity of 
its adq)ti(»i in i\merican schools nationwide.''^ 
Another aspect was that researchers obtained their 
data not from a controlled laboratory or limited trial 
programs, but from real schools in which millions of 
students were taking the tests. This period of testing, 
then, involved a complicated two-way interaction 
between the research community and the public, 
with the mass testing of children — and the use of test 
results to support important administrative decisions- 
occurring even as research on the validity and 
usefulness of tests continued to develop. 

It is not surprising that testing engendered public 
controversy, given that its most visible manifesta- 
tion in those days was in selection. Had the tests 
been used to diagnose learning disorders among 
children and to create appropriate interventions, they 
would have likely enjoyed more public support. But 
the tests were mostly used as they had b^n during 
the war, namely to classify (i.e., label and rank) 
individuals, and to assign them to positions accord- 
ingly. A U.S. Bureau of Education Survey con- 
ducted in 1925 showed that intelligence and achieve- 
ment tests were increasingly used to classify stu- 
dents.^^ Group-administered intelligence tests were 
most likely to be used for classification of pupils into 
homogeneous groups, and educational achievement 
tests were most likely to be used to supplement 



teachers' estimates of pupils' ability. Related survey 
data showed that 90 percent of elementary schools 
and 65 percent of high schools in large cities 
grouped students by ability, and that the use of 
intelligence tests as the basis for classiflcation was 
widespread. 

By the fall of 1920 the World Book had published 
nearly half a million tests, and by 1930 Tferman's 
intelligence and achievement tests (the latter pub- 
lished as the Stanford Achievement Ibst) had 
combined sales of some 2 million copies per year. If 
test production and sales are any indicator of social 
preferences, the data suggest a marked preference 
for achievement measures over tests of innate 
intelligence. Between 1900 and 1932, there were 
some 1,300 achievement tests on the market, as 
compared to about 400 tests of ''mental capaci- 
ties."^ High school tests, vocational tests, assess- 
ments of athletic ability, and a variety of m-scellane- 
ous tests had been developed to supplement the 
intelligence tests, and statewide testing programs 
were becoming more common.^^ 

The Iowa Program 

In 1929, the University of Iowa initiated the first 
major statewide testing program for high school 
students. Directed by E.F. Lindquist, the Iowa 
program had several remarkable features: every 
school in the State could participate on a voluntary 
basis; every pupil in participating schools was tested 
in key subjects; new editions of the achievement 
tests were published annually; and procedures for 
administering and scoring tests were highly struc- 
tured. Results were used to evaluate both students 
and schools, and schools with the highest composite 
achievement received awards. In addition, Lindquist 
was among the first to extend the range of student 
abilities tested. The Iowa Ibsts of Basic Skills and 
the Iowa Tfest of Educational Development became 
tools for diagnosis and guidance in grades three to 
eight and in high school, respectively. The Iowa 
program was also a significant demonstration of the 
feasibility of wide-scale testing at a reasonable cost. 



•Monroe, op. cit., footnote 44, p. 99. 
^opass, op. cit., footnote 25, p. 445. 

^» W.S. Pcffcnbaugh* Bureau of Education. U.S. Department of the Interior, **Uscs of InlcUigcncc Tbsts in 215 CiUes." City School Ixaflet No. 20. 
1925. 

^ Chapman, op. cit., footnote 29. (citing data from Hildrcth). p. 149. 
F R 1 C ^P* footnote 44, pp. 96, 106, and 1 11. 
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E.F. Undqjlst (1901-1978), at left, one of the fathers of 
standardized acNevenwnt testing, directed the Iowa 
testing programs* In 1952, E,R Undquist developed the 
t)aslc drcultry design for the first electronic scoring 
niachlne, as shown Mow, 
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By the late 1930s, Iowa tests were being made 
available to schools outside the State.^ 

Under Lindquist, the Iowa program had a remark- 
able*influence on swinging the pendulum of educa- 
tional testing back in the direction of diagnosis and 
monitoring^ and awa> from classification and selec- 
tion. Indeed, the distinction between intelligence 
and standardized achievement tests, in their design 
and content as well as their scores, was always fuzzy. 
In any event, the use of intelligence tests encoun- 
tered substantially heavier criticism than the use of 
achievement tests — ^if not on the grounds of their 
relative design strengths and weaknesses, then on 
the extent to which they became the basis for 
classifying and labeling children early in their lives. 

Multiple Choice: Dawn of an Era 

The achievement tests that gained popularity 
during the 1920s looked very different from the 
pre-World War I educational tests. Achievement 
tests were designed largely with the purpose of 
sorting and ranking students on various scales. Tbi^ 
model of test design has dominated achievement 
testing ever since. 

One of the most significant developments was the 
invention of the multiple-choice question and its 
variants. The Army tests maiked the fh-st significant 
use of the multiple-choice format, which was 
developed by Arthur S. Otis, a member of the Army 
testing team who later became test editor for World 
Book. In the view of the Army test developers, the 
multiple-choice format provided: 

... a way to transform the testees* answers from 
highly variable, often idiosyncratic, and always 
time-consuti ^ng oral or written responses into easily 
marked choices among fixed alternatives, quickly 
scorable by clerical workers with the aid of superim- 
posed stencils.^^ 

The multiple-choice item and its variant, the true- 
false question, were quickly adapted to student te^^ts 
and disseminated for classroom use, maricing an- 
other revolution in testing. Lindquist and coworkers 



at the Iowa program later invented mechanical and 
later electromechanical scoring machines that would 
make possible the streamlined achievement testing 
of millions of students.^ 

Not surprisingly, the rapid spread of multiple- 
choice tests kindled debate about their drawbacks. 
Critics accused them of encouraging memorization 
and guessing, of representing ''reactionary ideals** 
of instruction, but to no avail. Efficiency and 
''objectivity'' won out; by 1930 multiple-choice 
tests were firmly entrenched in the schools. 

Critical Questions 

In the late 19th and early 20th centuries, the 
potential for science to liberate the schools from 
their shackles of inefficiency was almost universally 
accepted. As suggested earlier, this fact helps 
explrin the i^parently ironic marriage of testing and 
progressivism. 

But if the spirit of progressivism catapulted 
scientific-style testing, it was that same progressiv- 
ism that ultimately reined it in. In a nutshell, the 
intelligence testers went too far. When Brigham 
used the Army data to argue that Blacks were 
naturally iiiferior; when Robert Yerkes wrote that 
one-half of the white recruits were morons; when H. 
H. Goddard suggested that the intellectually slov- 
enly masses were about to take over the affairs of 
state; or when a popular writer named Albert 
Wiggam "... declared that efforts to improve stand- 
ards of living and education are folly because they 
allow v^eak elements in the genetic pool to survive, 
[and] that 'men are bom equal' is a great 'sentimen- 
tal nebulosity' . . .••;^7 it became clear to progres- 
sives like >hn Dewey that testing had run amok. 

Thus, in the days immediately following the first 
World War, the "heyday of intelligence testing" 
was confix>nted by a kind of field day of ontitesting 
muckraking. And the muckrakers were progressives: 
most notably, Walter Lippman, whose 10 articles in 
the New Republic attempted to remind readers that 
". . .the Army Alpha hvd been designed as an 



^Julia J. Peterson^ The Iowa Testing Programs Gowa City, lA: University of Iowa Press, 1983), pp. 1-6. 

^Pifanz SamelsoD, • 'Was Early Mental Tfcsting (a) Racist Inspired, (b) Objective Science, (c) A Tfcdinology for Democracy, (d) The Origin of MulUple 
Choice Exams, (e) None of the Above? Mark the RIGHT Answer/* Psychological Testing and American Society: 1890^1930, Michael M. Sokal (cd.) 
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instrument to aid classification^ not to nieasure 
intelligence/*'^ It was almost as though Lippman, 
an early supporter of tests to aid in the efficient 
management of schools, suddenly recognized diat 
the very same tests could be put to different ends. 
^^Intelligence testing/* Lippman wamed, ^^could 
. . . lead to an intellectual caste system in which the 
task of education had given way to the doctrine of 
predestination and infant damnation/*^ 

College Admissions Standards: 
Pressure Mounts 

The admissions procedures establislied by the 
College Board had some clearly beneficial effects on 
education. They succeeded in enforcing some de- 
gree of uniformity in the college admissions process, 
helped raised the level of secondary school instrac- 
tion, engendered serious discussion about the appro- 
priate curriculum for college-bound youth, and built 
solid, cooperative relationships among higher edu- 
cation institutions throughout the country. 

Nevertheless, several influential colleges contin- 
ued to express concem that most secondary schools 
did not take the mission of college prq)aration 
seriously and did not organize their curricula within 
the College Board's guidelines. Momover, despite 
the board's energetic efforts at standardization, a 
large portion of the Nation's colleges continued to 
rely to some extent on their own examinations/^^ 

In additi(xi, college leaders were coming to a mure 
sophisticatedrecognitionof the limitations of achieve 
ment-type tests, including the College Board tests, in 
helping admissions officers discriminate between 
students who had stockpiled memorized knowledge 
and students with more general intellectual ability. 
Harvard was particularly sensitive to the s^parendy 
high number of applicants who, ^\ . . as a result of 
constant and systematic cramming for examinations 
. . . manage to gain admission without having 
developed any considerable degree of intellectual 



power. Partly in response to this problem 
Harvard developed a plan that in a fundamental way 
presaged the eventual swing from cuiriculum- 
centered achievement tests toward more generalized 
tests of intellectual ability: the plan called for a shift 
from separate subject examinations to ^^comprehen- 
sive" examinations designed to measure the ability 
to synthesize and creatively interpret factual knowl- 
edge. 

At Columbia University, as well, the pressure was 
on to do something about the admissions process. 
The arrival of increasing numbers of immigrants, 
many of them Eastem European Jews living in New 
Yoric City, fueled the xenophobia. Columbia's 
President, Nicholas Buder, for example, found the 
quality of the incoming students (in 1917) de- 
pressing in the extreme . . . largely made up of 
foreign bom and children of those but recently 
arrived. .. lb counteract this trend, Butler 
adopted the Thomdike Ibsts for Mental Alertness, 
hoping that . would limit the number of Jewish 
students without a formal policy of restriction."^^ 

In 1916, the College Board began developing 
comprehensive examinations m six subjects. ITiese 
examinations included performance types of assess- 
ment such as essay questions, sight translation of 
foreign lanj^iages, and written compositions. While 
the comprehensive examinations enabled colleges to 
widen the range of applicants, university leaders 
continued to watch with interest the development 
and growing acceptance of intelligence tests. 

Responding to the demand for standardization 
and for tests that could sort out applicants qualified 
for college-level work from those less qualified, the 
College Board developed the Scholastic Aptitude 
Ibst (SAT). The test was administered for the first 
time in 1926; one-third of the candidates who sat for 
College Board examinations took the new test, and 
the SAT was off to a promising start.^^^ 
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In addition to reinforcing the growing popularity 
of multiple-choice items, the SAT made several 
other contributions to the testing enterprise. First, 
the College Board took pains to try to prevent 
misinterpretation of SAT results. The board's man- 
ual for admissions office's cautioned that the new 
tests could not predict the subsequent performance 
of students with certainty and further warned of the 
pitfalls of placing too much emphasis on scores. 
Second, the board also adopted procedures from the 
outset to ensure confidentiality of test scores and 
examination content.^^ Third, the unique scoring 
scale, from 200 to 800, with 500 representing the 
average, indicated where students stood relative to 
others, a concept that helped lay the underpinnings 
for the eventual dominance of norm-referenced 
testing. 

Given the central role of colleges and universities 
in American life generally and their specific influ- 
ence on secondary education standards, it is perhaps 
not surprising that examinations designed for selec- 
tion soon became the basis for rather general 
judgments about individuals' ability and achieve- 
ment, or that in later years, the SAT would become 
the basis even tor inter-State comparisons of school 
systems. Qearly the SAT was not designed or 
validated for either of those puiposes,^^ as its 
designers have attempted to clarify time and again; 
the fact that it was ap;>ropriated to those ends, 
therefore, stands out as a warning of how tests can be 
misused. 

Testing and Survey Research 

Along with the increased use of standardized tests 
for tracking i i the elementary and secondary grades 
and for college admissions, the period between the 
wars also saw the fkst uses of standardized tests in 



large-scale school surveys. These studies, which 
paved the way for the kinds of program evaluations 
that would become so important in education policy 
analysis in the 1960s, had several aims. Researchers, 
joumalists, and charitable foundations seized on 
surveys as a way of calling attention to inequities 
and shortcomings in public education. Understand- 
ably, these studies met resistance from school 
superintendents, who resented being called on the 
carpet by outsiders. But as the old guard of 
superintendents were gradually replaced by people 
more familiar with the role of quantitative analysis 
in educational reform, and as superintendents came 
to see the benefits of an outside inventory of school 
needs, particularly in terms of increased public 
support for more funding, attitudes softened. 

The links between achievement test scores and 
later college perfonnance were further challenged 
by Ralph lyier's analysis of data generated m the 
^^Eighi-Year Study (1932 to 1940).i» In looking 
for evidence of a link between formal college- 
preparatory work in high school and eventual 
college perfomiance, lyier reached several impor- 
tant conclusions. First, his research revealed that 
certain basic tenets of the progressive mcvennent, 
e.g., deemphasizing rigid college entrance require- 
ments in the high school curriculum, did not produce 
graduates who were less well prepared for college 
work than those in traditional classrooms. Second, 
lyier's research ^^ . . confirmed the importance of 
following student progress on a continuous basis, 
recording data from standardized tests as well as 
other kinds of achievement. ^^^^^ Third, it set an 
important precedent for the use of achievement 
scores as a control variable in large-scale survey- 
based studies* Finally, the study demonstrated the 



»wibld..pp. 31-37. 

><^Tbe Scholastic Aptitude Ttet is intended as a source of additional infonnation, over and above high school grades, to predict freshman grade point 
average. While its predictive validity has been documented, even that rather modest mission— as compared with overall judgments of hnlividual ability 
or State education systems— is controversial. See, for example, Grouse and Tnisheim, op. cit, footnote 104. 

losiyack and Hansot, op. cit, footnote 9, p. 163. 

ic^^The study involved a group of 30 public and private secondary schools, which had been kvited to revise substantially their course offerings and 
provide a more flexible teaming envhonment for students intending to go to college. Cooperating with ^ese 30 schools were come 300 colleges and 
universities that bad sgreed to waive their formal admissions requirements, lyier examined !he effects of high school work on college performaiK^e among 
1,475 pairs of students— each consisth^ of a graduate of one of the 30 schools and a gradaate of another school not in the study, matched as closely 
as possible on race, sex, age, aptitude test scores, and background variables, 

O ^4tesnick« op. cit., footnote 3, page 186. 
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potential power of educational research as an agent 
of change«^^^ 

Another development in the years between the 
wars was high-speed computing, first applied to 
testing in 1935. Although there was by then little 
argument with the idea of standardized testing, the 
cost-effectiveness of using electronic data process- 
ing equipment to process massive numbers of tests 
was icing on the cake. One report showed that the 
cost of administering the Strong Inventory of 
Vocational Interests dropped from $5 per test to $.50 
per test as a result of the covapntex}^^ 

Testing and World War II 

Once again, new research ground was broken on 
the eve of world war. But unlike the experience with 
the Army Alpha program in World War I, the testing 
that took place during the second World War did not 
substantially affect educational testing; nor did it 
engender much public controversy. For one thing, 
testing was already so well ensconced in the public 
mind — several million standardized tests were ad- 
ministered annually by the outbreak of the war— diat 
the testing oi 10 million Army recruits hardly 
seemed out of the ordinary. Second^ the Army 
testing program did not focus on innate ability and 
the hereditarian issue. And third, it did not seem to 
rest on assumptions of a unitary dimension of 
intelligence. Rather, it seems that the theoretical and 
empirical studies initiated by Thurstone, Lindquist, 
and others had succeeded in persuading the Army 
psychologists to consider alternative models widi 
which to estimate soldiers^ abilities and future 
performance. 

''Multiple assessment,'* which examined distinct 
mental abilities, such as verbal comprehension. 



word fluency, number facility, spatial visualization, 
associative memory, perceptual speed, and reason- 
ing, was one of two significant technological devel- 
opments in testing during this period.^^^ Another 
was the transfer of testiog technology from the 
schools to the military. For example, elements of the 
Iowa Ibsts of Basic Skills and the lowa Ibst of 
Educational Development were borrowed by the 
Army for their World War n testing program, 
establishing the credibility of tests based on notions 
of multiple dimensions of ability. 

Equality; FairnesS; and Technological 
Competitiveness: 1945 to 1969 

Overview 

MucU of the controversy over student testing 
during the post-World War n period revolved 
around its uses in classification and selection. 
Although there had always been some dissent, 
controversy over student testing had entered a 
relatively quiet phase in the late 1920s, allowing the 
psychometric community to refine its craft and the 
educational community to create '\ « . the most 
tested generation of youngsters in history/ '^^^ But 
astute listeners in the early post-war years could 
detect faint rumblings of conflict; by the end of the 
1960s testing would once again be in the eye of 
storm over educational and social policy* 

Three sets of forces came to bear on the schools 
in general and on testing policy in particular during 
the 1950s and 1960s: demographic change, due 
largely to new immigration, which once again 
challenged the American ideal of progressive educa- 
tion; technological change, brought into shaip relief 
by the launching of Sputnik, which ignited nation- 



1 1 »Coimiicntlng on the Ei^t-Year Study, Lee CronbACh and Patrick Stippes wrote: 

Although the study was carried out as planned, one cannot escape the impression that the central question was of minor interest to the investigators 
and the educational conununity. The main contribution of the study was to encourage the experimental schools to explore new teachktg and 
counseling procedures, 

Lee Cronhach and Patrick Suppes (eds.)* Research for Tomorrow's Schools: Disciplined Irquiryfor Education (New York, NY: MacMiUan Publishing 
Com 1969), pp. 66-67. George Madaus (personal communtcation, 1991) notes that the £i|^-Year Study was a turning point hi the design of tests: it 
supported 'Qrler^s aigument that direct measures of performance needed to precede the design of indirect measures. See also G, Madaus and D. 
Stu£flebeara (eds.)f Educational Evaluation: Classical Works of Ralph W. J^ler (Boston, MA: Kluwer, 1989). 

is^lesnick, op. cit., footnote 3, p. 190. For more discussion of the technology of testhig see ch. 8. 

I i^lb this day, die debate between the unitary and multidimensional intelligence ttieorista rests hi stalemate, hugely because each camp uses difTerent 
mathematical noModels to analyze test scores. As Howard Gardner has neatly pointed out: **Given the same set of data, it is possible, usiog one set of 
factor-analytic procedures, to come 19 with a picture that supports tfie idea of a *g* factor, ushig another equally vaUd method of statistical analysis it 
is possible to support the notion of a family of rebtively discrete mental abilities.* * Howard Gardner, Frames of Mind, 2nd ed. (New Yoric, NY: Bask; 
Books, 1985), p. 17, and ch. 6 of this repc«t« 

iiH^emin, op. dt, footnote 2, p, 192. Daniel and Lauren Resnkk would later embellish this theme, arguhig that ^'American children were the most 
tested hi the world— «nd the least examined.'* See }>anie! P. Resnidc and Lauren Resnick, **Standards, Curriculum and Performance: A Historical 
Q ipective,** Educational Researcher, vol. 14, No. 4, April 1985, p. 17. 
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PMoamMtMmfory Collins 

Testing of children has often Involved oral as well as written 
woric These first grade pupils at the Lincoln School of 
Teachers' Coliege, Columbia University, are recording 
their voices for diction correction, drca 1042. 

wide iiiterest in science and mathematics education 
as well as higher standards of schooling overall; and 
the awakening of the public conscience to the 
problems of racial inequality in the Naticm^s public 
schools, which led to wholly new approaches to 
school governance, financing, and participation. 

Access Expands 

Enrollment in public elementary and secondary 
schools jumped from 25 million in 1949-*S0 to 46 
million in 1969-70, or from 17 percent of the total 



population to over 22 percent. The number of hig^h 
school graduates went from just over 1 million in 
1950 to 2*6 million in 1970. The trend was even 
more impressive in the postsecondary sector total 
enrollments in institutions of higlier education went 
from 2.6 million in 1949-50 to 8 million in 1969-70. 
While part of the enrollment growth is explained by 
the size of the ^^baby boom'' cohort, the increase in 
the proportion of the population enrolled in school 
signifies progress toward the goal of universal 
access. 

The timing of this upsurge in participation sug- 
gests that through decades of increased reliance on 
standardized tests, the progressive spirit in Ameri- 
can education had not only survived, but had 
actually flourished. Several points need to made in 
this regard. First, recall that student classification 
had been viewed by the early progressives as a 
means to render schooling more efficient: it was 
when tests became designed and used to classify 
students on the basis of innate ability — and to 
allocate educational resources accordingly — ttiat 
some of the Progressives began to protest. Although 
the proponents of testing could argue that their 
approach was intended to ensure continued high 
standards of school quality, the resulting sorting and 
tracking of children was anathema to many leaders 

of the Ptogressive movement (Dewey, in particu- 
lar)."5 

Second, both sides claimed to have the welfare of 
children and the Nation at heart. It was commonly 
agreed that schooling needed to improve; the dispute 
arose over the choice of strategy. One side favored 
increased access to education by all students, and 
tolerated or supported testing as a way to manage 
ma^'^ive public education more efficientiy. The 
in^Ucit assunqition was an egalitarian one: all 
children could leam. The other side also favored 
testing; but the underlying assumption was that 
some children were innately more citable of learn- 
ing than others, and that classification would keep 
standards high for the more able students while 



> t^Oo the accepubility of lesdqg by the Progressive movement, see also CiODbach, op. clt . footnote 40, p. 8. While Cronbftch concedes that the testers 
diemselves may have gone too farintheirrellaiiceonthetie' - scienceof mea8uiment,he8eeinstoplacenK>reof theblameforcontrovenyoo^ 
press: "Virtually everyone favoied testing in the controversies arose because of incautious intetpretations made by the testers and, even more, 
^ l>opular writers." 
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sparing the slower ones the embairassment of 
failure."^ 

The Tbst of General Educational Development 
(GED) played an interesting role in expanding 
educfitional access. The GED was formulated by the 
U.S. Armed Forces Institute, in cooperation widi the 
American Council on E^tcation» to address the 
problems of returning service personnel who had 
been inducted before graduating from high school 
Pattemed after the Iowa Ibst of Educational Devel- 
opment and constrocted with substantial input from 
Lindquist^ the GED was intended to enable out-of- 
school youth and adults to demonstrate knowledge 
for which they would receive academic credit and in 
some cases a high school equivalency dq>loma«^^'^ 

Thus* the postwar enrollment boom and the 
development of the GED could be viewed as a 
victory for universal access. But the analysis would 
be remiss without repeating the obvious: these 
developments took place in an education culture 
fully infused with standardized tests. Indeed* it 
would be possible to argue — as some did — that tests 
opened gates of opportunity, that access to school 
was enhanced* not encumbered* by objective tests.^ 
In later years this theme would be echoed by some 
minority leaders* who argued that standardized tests 
allowed children the opportunity to demonstrate 
their ability more effectively — and more fairly — 
than they had been able to in the highly subjective 



environments of their impoverished classrooms.^^' 
This curious nature of testing — ^it could be assigned 
responsibility for enhancing or for confining oppor- 
tunities for advancement— -sheds light on its power- 
fully symbolic role in American society generally 
and in education specifically. 

Developments in Technology 

American enchantment with technology during 

the 1950s produced several strides in the field of 

testing. Most noteworthy was the automatic scoring 

m iichine* a form of optical scanner invented by the 

Iowa Ibsting Program. The machine enabled tests to 

be processed in large volume and at a reasonable 

cost.^^ During the next 12 years* the Iowa program* 

througli its engineering spinoff, the Measurement 

Research Center* perfected several generations of 

scanners* each smaller but more powerful than the 
last.^" 

With this equipment* national testing programs 
became feasible. Although the optical scanning 
equipment did not in itself drive up demand for 
testing* it gave an efficiency edge to tests that could 
be scored by machine and enabled school systems to 
inq>lement testing programs on a scale that had 
previously been unthinkable. An enormous jump in 
testing ensued. One estimate of the number of 
commercially published tests administered in 1961 



1 i^Tbe tension between accets and itandards hu been a kmgtlanding motif in education policy debates. Lawrence Ciemin iUustrales it eloquently in 

his sunmiaiy of fonner Harvard Presideot James Conant*s conflicted views on tlie subject: 

For C6nant ... the mixing of youngsters from different social backgrounds widi diffdrent vocaacnal goals in comprehensive high schools is 
important to the continued cohesivcoess and ctosslessness of American society, impc«tant eoougib Ko maintain in the face of the difficulty of 
providing a worthy education to the academically talented in the context of that mixing. Hence, the central problem for American education is 
how to preserve the quality of the education of the academically talented in comprehensive hi^ schools. 

Cremin, op. cit., footnote 2, p. 23. 

ii'^Peterson, op. cit, footnote SH. p. 82. 

"^Christopher Jencksand Ehivid Reiaman argue that the ' 'conservatives* * in the debate over colleg;) admissions policies were those who disliked tests 
and who preferred the old-fuhioned criteria (e.g., that sons of alumni should be granted preference); and that the ''liberals** were those who favored 

. . seeking out the ablest students . . . wherever they might come from.* * Ihey go on to suggest that vMt the liberals appear to have been winniog, 
there has been a '*... risiog crescendo of pn>test, especiaUy from the civil rights n^^ 

toe use of tests to select students and allocate ac^rfmic resources. * * See tfieir seminal wort:» Tne Academic Revolution (New York, NY; Doubleday, 1969) 
pp. 121 ff. 

1 i^See, e.g., Donald Stewart, * 'Thinking the Unttiinkable: Standardizied Ibstiqg and die Future of American Education, * ' speech before the Columbus 
Metropolitan Club, Cohmibus, OH, Fd>. 22, 1989. Stewart, who is president of the College Entrance Examination Board, notes that* 

In a countiy as multicentric and pluralistic u ours, only a standardly^ test that woifcs like the SAT is going to be vduable. . .in prov 

. . . national sense of the levels of educational ability of diCTerent individuato and also diftotnt groups. 
He goes on to note that: 

... the SAT has made it possible for students from every background and geogn|)hic origin to attend even the most prestigious institutions, 
i^^erson, op. dt, footnote 94, p. 89. 
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was 100 million^^^ — just undn 3 tests per year^ on 
average, for each student enrolled in grades K-12.^^ 

In 1958, Iowa also introduced computerization to 
the scoring of tests and production of reports to 
schools. This early and rather primitive application 
of computers to the field of testing helped propel two 
decades of research and development that culmi- 
nated in highly sophisticated programs of computer- 
based testing. 

But technology played an important role not just 
in the design and implementation of tests, but as a 
catalyst to renewed interest in the use of testing to 
improve education. By the mid-19S0s, a majoi 
expansion in educational opportunities was taking 
place amid a continued reliance on standardized 
tests to diagnose and classify students and monitor 
school quality. The impetus for this expansion came 
in large part from America's rade awakening to 
global technological advance: the Soviet launching 
01 Sputnik (Oct. 4, 1957) spurred many Americans 
to question whether the battlefield victories in World 
War n were sufficient for America to win the peace 
that followed. As in prior periods of perceived 
external challenge, the policy response centered on 
education, and as in prior perioids, the education 
reforms involved increased testing. The general idea 
behind the National Defense Education Act of 1958 
was to provide Federal funds for upgrading mathe- 
matics and science education in particular. 

One means for accomplishing this got was the 
allocation of Federal dollars to support the develop- 
ment and maintenance of: 

... a program for testing aptitudes and abilities of 
students in public secondly schools, and ... to 
identify students with outstanding aptitudes and 
abilities ... to provide such infc^mation about the 
aptitudes and abilities of secondary school students 
as may be needed by secondary school guidance 
personnel in carrying out their duties; and to provide 
information to other educational institutions relative 
to the educational potential of students seeking 
admissions to such institutions. . . .^^ 



Race and Educational Opportunity 

The birth of the modem civil rights movement 
was a watershed in American history and marked a 
turning point in the history of schooling. It also 
altered the course of testing policy and raised new 
debates about the design and use of various tests in 
school and the workplace. 

In 1954, the Brown v. Board of Education 
Supreme Court decision ruled out racid segregation 
in schools, thereby establishing the legal prescrip- 
tion for cornpledng the mission of the public school 
movement. It had taken about 100 years to address 
this glaring anomaly in a school system predicated 
on the ideal of universal access. Brown had no 
immediate and direct consequences for testing, but 
it set in motion social and ideological forces that 
would, in years to come, bring student testing into 
new arenas of controversy and, for the first time, into 
the courts. 

In a second significant court case, Hobson v. 
Hansen (1967), filed on behalf of a group of Black 
students in Washington, DC, the policy of using tests 
to assign students to tracks was challenged on the 
grounds that it was racially biased. The judge 
concurred; although the test was given to all 
students, the court found that because the test was 
standardized to a white, middle class group, it was 
inappropriate to use for tracking decisions.^^ 

The explicit rejection of the notion of ^ ^separate 
but equal* ' in Brown set the tone for challenges such 
as Hobson^ which found that tests used for classifica- 
tion could resuh in the kinds of racially segregated 
classrooms (or schools) explicitly outlawed by 
Brown. A new branch of applied statistics emerged, 
concerned with the analysis of group differences in 
test scores in order to determine the potential 
"adverse impact'' of test use in certain kinds of 
decisions. 



>22David Ooslin. The Starch for Ability (New York, NY: Russell Sage. 1963). 

i23K)tal K-12 tflrollments in the 1959-60 school year were just over 36 miilion. See U.S. Department of Education, Digest of Education Statistics, 
1990 (Washington, DC: U.S. (3ovemmem Printing Office. 1991). p. 47. 

i^National Defense Education Act, Public U\v 85-864. 

O >M269F.Supp. 401 (DJD.C. 1967). . , 
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Controversies emerged over the effects of tests in 
correcting or exacerbating racial inequality.^^ Two 
other points need to be made about this period. First, 
the civil rights movement led to the development of 
a wide range of social programs, which in turn 
created new demands for accountability measures to 
ensure that Federal money was being well spent. A 
century after accountability became a purpose of 
student testing at the State anc local level, the model 
was being applied on a grand scale to national issues. 
The 1965 Elementary and Secondary Education Act 
in particular opened the way for new and increased 
uses of norm-referenced tests to evaluate programs. 

Second, controversy over the quite obvious in- 
creased reliance on testing for selection and monitor- 
ing decisions did not abate; on the contrary, even the 
notion of using certain kinds of ability tests to 
classify children into categories such as ''educably 
mentally retarded,'' for the purpose of giving them 
special educational treatment, came under strident 
criticism by parents and leaders who viewed the 
classification as potentially harmful to their chil- 
dren's long-term opportunities. 

Recapitulation 

Ibsting of students in die United States is now 150 
years old. From its earliest incarnation coinciding 
with the birth of mass popular schooling, testing has 
played a pivotal role in the American experiment 
widi democratic education. That experiment has 
been unique in many ways. Not only did it begin 
well before most other industrialized countries 
expanded schooling to the masses, but it was carried 
out in a uniquely American, decentralized system: 
today 40 million children attend schools scattered 
across some 15,000 local school districts. If there 
have been taboos in American education, they have 
concerned national curriculum, national standards, 
and national testing. 

Yet for all its div^ersity, the American system also 
shows some remarkable uniformity and stability. 



Beneath the surface of institutional independence 
lies a strong unifying force, a tacit agreement that a 
principal objective of schooling is community: **E 
pluribus unum" does not stop at the schoolhouse 
door. But neither does it come with a handy recipe 
to make it work. Indeed, the apparently endless 
struggle over the stroctuie, content, and quality of 
American education — and of educational tests — 
stems in part from the tension between the judg- 
ments of teachers, parents, and students on the one 
hand, and the quest for community. State, or even 
national standards, on the other. 

Ibachers in their classrooms have always used all 
kinds of tests— everything from spot quizzes to 
group projects — as part of the continuous process of 
assessment of individual student learning. At the 
same time, as this chapter has shown, standardized 
examinations have been used at least since the 
mid-19th century to keep district and State education 
authorities, and the legislatures that fimd them, 
informed about the general quality of schools and 
schooling. From their inception, these tests have 
been used to inform institutional decisions about 
student placement and resource allocation, and they 
have been seen as a way to influence teaching and 
learning standards. 

Ibday the United States stands again at the 
crossroads of major transition in student testing. The 
issues framing today's public policy debate- 
perceived decline in academic standards, shifts in 
the demographic composition of the student popula- 
tion, hei^tened awareness of global technological 
competition, and lingering inequality in the alloca- 
tion of educational and economic opportunities — 
have been evolving for two centuries. Lessons from 
the history of educational testing provide important 
background to the development of testing policies 
for the fiiture. 



l^^Tbe most vehement debate was sparked by the 1 969 pubUcation of an aiticle by Arthur Jensen questioning whether school intervention programs 
(such as Head start) codd affect IQ, which was Uttgelydetemiined by heredity, ''ee Arthur Jensen, "How Much Can We Boost IQ and Achievement?** 
Harvard Educational Review, vol . 39, winter 1969. For review of this controversy see. e.g. . Cronbach, op. cit., footnote 40; Mark Snyderman and Stanley 
Rolhman, The IQ Controversy: The Media and Public Policy (New Bninswict NJ: Tiansaction Books, 1988); and Fancher, op. cit.» footnote 43. 
Q '^lUs picture is changing. See discussion in chs. 1 and 2 of this report. 
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CHAPTER 5 

How Other Countries Test^ 



Highlights 

• There are fundamental differences in die histeny , purposes, and organization of schooling between die 
United States and otfier industrialized nations. Con^arisons between testing in the United States and 
in odier countries should be made prudently. 

• The primary purpose of testing in &nt>pe and Asia is to control die flow of young peqple into a limited 
nundia: of places on the educational pyramid. Although many countries have recently implemented 
reforms designed to make schooling available to greater proportions of their populations, testing has 
remained a powerful gateway to fiiture opportunity. 

• No country that OTA studied has a single, centrally administered test used for the multiple functions 
of testing. 

• Standardized national examinations before age 16 have all but disiqppeared from Europe and Asia. The 
United States is unique in its extensive use of examinations for young children. 

• Only Japan uses multiple-choice tests as extensively as die United States. In most European countries, 
students are required to write essays **on demand.*' 

• Standardized tests in other countries are much more closely tied to school syllabi and curricula than 
in the United States. 

• Commercial test publishers play a much more influential role in the United States than in any other 
country. In Europe and Asia, tests are usually estali^ Jied, administered, and scored by ministries of 
education. 

• Ibsting poUcies in aknost every industrialized country are in flux. The form, content, and style of 
examinations vary widely across nations, and have changed in recent years. 

• Ibachers have considerably greater responsibility for development, administration, and scoring of tests 
in Europe and Asia than in ttie United States. 



International comparisons of student test scores 
have become central to the debate over reform of 
American education. Reports suggesting that Amer- 
ican students rank relatively low compared to their 
European and Asian peers, especially in mathemat- 
ics and science, have coincided with growing fears 
of permanent erosion in Americans economic com- 
petitiveness, and have become powerful weapons in 
the hands of school reformers of nearly every 
ideological stripe. 

A recent addition to this arsenal of comparative 
education politics is the examination system itself: 
many education policy analysts in the United States 
who envy the academic performance of students in 
Europe and Asia also envy the structure, content, and 



administration of the examinations those children 
take. In the current debate ov^^x U.S. testing reform 
options, it is common to hear rhetoric about the 
advantages of national examinations in other indus- 
trialized countries; some commentators have gone 
so far as to suggest that tougher examinations in the 
United States, modeled after those in other coun- 
tries, could motivate greater diligence among stu- 
dents and teachers and alter our slipping global 
competitiveness.^ 

But these arguments are based on an exaggerated 
sense of the role of schools in explaining broad 
economic conditions, and on misplaced optimism 
about the effects of more difficuU tests on improving 



^Material in this chapter draws extensively on the OTA contractor report by George F, Madaus, Boston College, and Thomas Kellaghan, St. Patricks 
College, Dublin, **ExaminaUon Systems in the European Community: Implications for a National Examination System in the United States.'* Anril 
199L ^ 

^ ^see, c,g„ Robert Samuclson, **Thc School Reform Fraud/' The Washington Post, June 19. 1991, p. A19. 
ERIC -933 0 - 92 - lU QL 3 
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education.^ The rhetoric that advocates national 
testing using the European model tends to neglect 
differences in the history and cultures of European 
and Asian countries, the conq>lexiti?s of their 
respective testing ^sterns, and the fact that their 
education and testing policies have changed signifi- 
cantly in recent years. 

Explaining international differences in test scores 
is a delicate business.^ Similarly, drawing inferences 
from other countries' testing policies requires atten- 
tion to the educational and social enviionments in 
which those tests operate. As a backdrop to the 
analysis in this chapter, it is important to keep in 
mind some basic issues affecting the usefulness of 
intemational comparisons of examination practices. 

• Testing policies are in transition in most 
industrialized countries, where the pressures of 
a changing global economy have a ripple effect 
on public perceptions of the adequacy of 
schooling. 

• Parents in Europe and Asia, like their counter- 
parts in the United States, tend to praise their 
own children's schools while decrying the 
decline in standards and quality overadl.^ 

• There is considerable variation in the structures 
and conduct of school systems within Europe 
and Asia. For example, there is probably as 
much difference in the degree of centralization 
of curriculum between Germany and France as 
there is between France and the United States. 
These differences are reflected in testing poli- 
cies that vary from country to countiy in 
important ways. In Australia, Germany, Can- 
ada, or Switzerland, for exan^le, provincial (or 



State) governments have considerably more 
autonomy in the design and administration of 
tests than in France, Italy, Sweden, or Isnzl 
Ibst format differs too: Japan relies heavily on 
multiple choice and Germany still uses oral 
examinations, while in most other countries the 
dominant fomi is ^^essay on demand.'* 

• The functions of testing have different histori- 
cal roots in Europe and Asia than in the United 
States. Steeped in the traditions of Thomas 
Jefferson, Horace Mann, and John Dewey, the 
American school system has been viewed as the 
public thoroughfaie on which all children 
journey toward productive adulthood, Univer- 
sal access came relatively later in Europe and 
Asia, where opportunities for schooling have 
traditionally been rationed more selectively 
and where the benefits of schooling have been 
bestowed on a smaller proportion of the popula- 
tion. Although recent reforms in many Euro- 
pean countries have opened doors to greater 
proportions of children, the role of tests has 
remamed principally one of ''gatekeeper" — 
especially at the transition from high school to 
postsecondaiy.^ In this country higher educa- 
tion is available to a greater proportion of 
coUege-age children than in any other industri- 
alized country. 

• There is considerable variation among Euro- 
pean and Asian countries with respect to both 
the age at which key decisions are made and the 
permanence of those decisions. For example, 
second chances are more likely in the United 
States and Sweden than in most other countries, 
which do not provide many options for students 



3Sec, eg., Clark Kwr, ''Is Education RcaUy AU ITiat GuUty? • Education Week, vol. 10, No. 3, Feb. 27, 1991, p. 30; Lawwocc Cremin, Popular 
Education and Its Discontents (New York, NY: Hamper and Row, 1990); and Richard Munuue, • *Educatiioii and the Productivity of die Woric Force: 
UMrfdng Ahead,** American Uving Standards, Robert E. Litan, Robert Z. Uwrcuce, and Charles L. Schultze (eds.) (Washington. DC: BrooUnss 
InstitoUon, 1988), pp. 215-246. ^ « 

<Sec Iris Rotberg, "I Never Promised You First Place,' ' Phi Delta Kappan, vol. 72, No. 4, December 1990; and the rejoinder by Norman Bradbum, 
Edward Haertel John SchwiUe, and Judith Ibmey-Purta. Phi Delta Kappan, vol. 72, No. 10, June 1991, pp. 774-777. F6r discussion of how American 
postsecoodary education ought to be factored into intenuUional comparisons, see Michael Kirst, * 'The Need to Broaden Our Per.$pectives Concerahw 
America's Educational Attainment,** Phi Delta Kappan, vol. 73, No. 2, October 1991, pp. 1 18-120. 

^James Irvhig, director of Uamhig and Assessment Policy Division, New Zealand Ministry of Education, personal communication, February 1 990. 
For the United States, laicst Gallup poll shows ratings of puUic schools have remained basically staMe since 1984. The most striking aspects are 
the higher ratings the ^aulic in general give their local schoote (42 percent 
schools overaU (only 2 1 percem rate them an ** A* * or **B* *). Most signifk^nt, howe^^^ 

give to the schools their own children attend (73 percent rate th^e schools an * *A** or **B**). It is sug|jp»ted that the moie firsthand knowledge one has 
about the puMk schoob, the more favorable one*s perception of thant Stanley M. Blam, LoweU C. Rose, and Alic M. OaUup, **'nie 23nl Annual (3aUup 
poll of the Public*s Attitudes Ibward the Public Schools,** Phi Delta Kappan, vol. 73, No. 1, September 1991, p. 54. 

^ee Max A. Eckstein and IlaroM J. Noah, **Forms and Ftmctions of Secondary-School Leaving Examinations,** Comparative Education Review, 
vol. 33, No, 3, August 1989. p. 303. It is important to note that Japanese children enjoy considerably greater access to schooling tiian is commonly 
believed. For a summary of mytiis and daU regardh^ Japanese education, see William Cummings, • •The American Perception of Japaneac Education,* * 
^-nparative Education, vol. 23, No. 3, September 1989, pp. 293-302. 
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who bloom late or have not done well on tests. 
In Japan, chiluAen are put on a track early on: 
the right junior high school leads to the right 
high school, which leads to the right university, 
which is the prerequisite for the best jobs. 
Japanese employment reflects the rigidity that 
begins with schooling: job mobility is neglible, 
' 'career-switching' ' a totally alien concept. 
Employment opportunities for Rench, Ger- 
man, and British students are significantly 
affected, albeit in varying degrees, by perform- 
ance on examinations. 

The purpose of this chs^ter is to consider lessons 
for U.S. testing policy that can be drawn from the 
experiences of selected European and Asian coun- 
tries. The first section provides an overview of 
education and testing systems in the European 
G)imnumty (EC) and other selected countries. The 
second considers lessons for U.S. testing policy. The 
last section contains '* snapshots*' of examination 
systems in selected countries. 

Teaching 3nd Testing in the EC and 
Other Selected Countries'^ 

Origins and Purpose of Examinations 

The university has always played a central role in 
examination systems in most European countries.^ 
In France, for example, the Baccalaureat (or Bac) 
was established by N£q)oleon in 1808 and has been 
traced to the 13tfi century determnance^ an oral 
examination required for admission to the Sorbonne. 
The Bac was the passport to university entrance in 
France until recently, when additional admissions 
requirements were developed by the more prestig- 
ious schools. 

Universities also played an important role in the 
establishment of examinations in Britain. London 
created a matriculation examination in 1838, which 
in 1842 became the earliest formal written school 
examination.^ The system established at the Society 



of Arts, taken as an exemplar by other systems, was 
modeled on the written and oral examinations used 
at the University of Dublin. Oxford and Cambridge 
established systems of ^ ^locals, ' ' examinations graded 
by university ^'boards'' to assess local school 
quality. In 1858, they began to use diese examina«* 
tions for individual students and, in 1877, to select 
them for university entrance. Other universities 
(Dublin and Durham) followed the same path and 
established procedures for examining local school 
pupils. The system of university control of examina- 
tions continued throughout the second half of the 
19th century. 

During the 18th and 19th centuries European 
countries also began to develop examinations for 
selection into the professional civil service. The 
puiposes of the examinations were to raise the 
competency levels of public functionaries, lower the 
costs of recruitment and turnover, and control 
patronage and nepotism. Ptussia began using exami- 
nations for filling all government administrative 
posts starting as early as 1748, and competition for 
university entrance as a means to prepare for these 
examinations followed. The British introduced com- 
petitive examinations for aU civil service appoint- 
ments in 1872. 

Public examination systems in Europe, therefore, 
developed primarily for selection, and when mass 
secondary schooling expanded following World 
War n, entrance examinations became the principal 
selection tool setting students on their educational 
trajectories. In general, testing in Europe controlled 
the flow of young people into the varying kinds of 
schools that followed compulsory primary school- 
ing. Students who did well moved on to the 
academic track, where study of classical subjects led 
to a university education; others were channeled into 
vocational or trade schools. 

In the last two decades, the duration of con^ul- 
sory schooling has become longer; the trend has 



TThe 12 members of Cbe European Community (EC) aie Belgium* Denmark* Firance, Germany, Qieece, Ireland* ludy, Luxembourg. The Netherlands, 
IV)rttigal» Spain, and the United Kingdom. Much of the general discussion of EC education and examination systems is taken from Madaus and Kellaghan, 
op. cit., footnote 1. Vox comparative data on U.S. and Jf^panese education* see, e.g., Edward IL Beaucfaamp, **Refonn Traditions in the United States 
and Japan* * * Educational Polides in Crisis, William K. Cummiogs, Edward Beauchamp, Shogo Ichikawa* Victor N. Kobayashi, and Morikazu Ushiogi 
(eds.) (New York* NY: Fiaeger Publishers* 1986). 

*in the United Suites, secondary schooiiog is more closely linked* in structure and content* with primary than witiii university education. Other 
countries* elite secondary s'jhools are closely linked to universities. See Nfaitin Trow, **The Suite of Higher Education in the United States,** in 
Cummings et at, op. cit, footnote 7, p. 177. 

'Some professional bodies had already introduced written qualifying examinations (Society of Apothecaries in 1815 and Solicitors in 1835). The 
^ don examination initiated in 1842 was the first formal school examination of its kind. 
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generally been to provide access to comprehensive 
schooling for more students and to provide a wider 
variety of academic and vocational choices. Exami- 
nations that filter students into different kinds of 
schools, once given at the end of primary school 
(around age 11), now take place around age 16 or 
even 18. The uses and formats of these ''school- 
leaving'' examinations are evolving as more options 
have become available and larger percentages of 
students seek and can gain access to postsecondary 
education. In several countries, school-leaving ex- 
aminations that were once considered a passport 
higher education have evolved into first stage or 
qualifying examinations, which are followed by 
more diversifieo examinations for specific prestig- 
ious universities or lines of study administered by 
the university itself. Examples are the French 
Baccalaureaty the German Abitur, and the Japanese 
Joint First Stage Achievement Test (JFSAT).^^ 

Standardized examinations are not generally used 
outside tiie United States fr^ puiposes other than 
certification or selection. However, some exceptions 
are noteworthy. In Sweden, standardized examina- 
tions are used as scoring benchmarks to help 
teachers grade studrats uniformly and properly in 
their regular classes. Examination results in a few 
countries serve not only to evaluate student perform- 
ance but also to evaluate the quality of a teacher or 
school. This was the approach, now abandoned, in 
England during the second half of the 19th century, 
when ''payment by results'' was based on student 
scores. Today student scores in China have taken 
on this school accountability function > in that "Key 
Schools" in Qiina receive extra resources in recog- 
nition of their better examination results. 



Central Curricula 

In most EC countries curriculum is prescribed by 
a central authority (usually the Ministry of Educa- 
tion). However, the level of prescription varies from 
system to system, In Germany, curricula are deter- 
mined by each of the 11 States,^^ in France the 
curriculum is quite uniform nationwide, and in 
Denmark individual schools enjoy considerable 
discretion in the definition of curricula. The trend in 
several countries has been to allow schools a greater 
say in the definition of curricula during the compul- 
sory period of schooling; school-based management 
and local control are not uniquely American con- 
cepts. 

The United Kingdom^^ seems to be moving in the 
other direction. In the past, curricula in the United 
Kingdom were determined by the local education 
authorities and even individual schools. Independ- 
ent regional exr. 'nation boards exerted a strong 
influence on the curricula of secondary schools. The 
central government significantly tightened its grip 
around the regional boards beginning in the mid- 
1980s, and since the Education Reform Act of 1988 
the U.K. has moved toward adoption of a common 
national curriculum. 

Divisions Between School Levels 

Most European countries have maintained the 
conventional division between primary, secondary, 
and third-level education. The primary sector offers 
free, compulsory, and common education to all 
students; the secondary level is usually divided into 
lower and upper levels. The duration of primaiy 
schooling can vary among the States or provinces of 
a given country. 



i^^s has changed slightly with the change from the Joint First Stage Achieveoacut Ibst (JFS AT) to the Ibst of the National Center for University 
Entrance Examinations (TNCUEE). The JFS AT was required only for those candidates applying to national and local pubUc universitin (approximately 
49 percent of total 4-year univen:ity applicants), not those q)plying to private universities. Some applicants for private universities now also take the 
TNCUEE. Shin*ichiro Horie, Pres;: and Inforaution Section, Embassy of Japan, pcfsonal communication, Aug. 2, 1991. 

1 ^In 1862, the British Oovermnent Adopted the Revised Code of 1862, which established the criteria for the award of government grants to elementary 
schools. Each child of 6 and over was to be examined individually by one of Her Majesty's Inspectors toward the end of each school year. Attendance 
records v^ere also taken into consideration. Thus, each child over 6 could earn the school 4 shillings for regular attendance and a luitber 8 shillings for 
successful performance in the aimual examination. Clare Burstall, * 'The British Experience With National Educatioiud Goals and Assessment,* * paper 
presented at the Educational Dating Service Inviutional Conference, New York, NY, October 1990. 

t^Eckstcin and Noah, op. cit, footnote 6, p. 307. 

i^This is also the case in Canada and Australia, where each of the provinces or States sets its own curricula. 

l^The tenn **United Kingdom** (England, Wales, Scotland, and Northern Ireland) is used throughout this document, lasting practice in Northern 
Ireland, England, and Wales is similar, but Scotland is unique, with a completely different structure of testiog and examinations. ScotlarKl has only one 
examiiiing board, with ck>se connections to the central Scottish Education Department; the other countries in the United Kingdom each have several 
examining boards. Desmond Nuttail, director of the Centre for Educational Research* London School of Economics and Political Science, posooal 
O .munication, June 1991. 
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Table 5-1— Data on Compulsory School Attendance and Stncture of the 
Educational Systems in the European Comm^n'ity 
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^Belgium And lr*l«nd hovt an additional 2 y Mrs praprtmary •ducatlon Integrated Into tha primary school systam. All 
other oountrlaa hava provision outslda tha formal aduoational aystsm for aarty childhood education. 

^glum and Qermany are federations. There are two States In Belgium with oompleCety l^xlependent educational 
systems. There are 1 1 States In the former Federal Republic of Qermany (16 in the new Qemrwny). Each of the 1 1 
States determines Its curriculum under terms agreed by the Council of State MInistera of Education* 

^^ number of oountrlee are laae advanced than others In comprahenslvenese of their school stnictures. 

SOURCE: Qeorge F. Madaus. Beaton College, and Thomai Kellaghan, St. Patricks College, Dublin, ''Examination 
Systems In the European Community: Impllcatlont for a National Examination System inthe United States/' 
OTA contractor report, April iggi , table 3. 



Most European countries at one time required a 
national school examination at the end of primary 
schooling. These examinations were intended to 
clarify for teachers the standards that were expected, 
provide a stimulus to pupils, and certify completion 
of a phase of formal education. They were used for 
admission to secondary education and for pre- 
employment screening. But these examinations 
raised many concerns about their limiting effects on 
the curriculum and about the tendency among some 
schools to retain students in grade in order to prevent 
the low achievers from presenting themselves for 
examinations. 

Perh^s most important, however, were the changes 
in the philosophy of education that led to raising the 
school-leaving age and provision of adequate space 
in secondary schools to accommodate all students. 
Secondary education was once highly selective, with 
relatively low participation rates beyond the primary 
level, and with major divisions between two or three 
types of schooling. The most exclusive was the 

grammar school,* * **gymnasiimi,** or *4ycee,** 
which prepared students for third-level education 



and professional occupations. Typically, the school 
systems of Europe offered a classical academic 
curriculum in the liberal arts. As numbers of students 
in this line of study grew, the traditional academic 
curriculum became diversified, subjects were pre- 
sented at different levels, and some students took 
practical or commercial-type subjects. 

After the second World War, and particularly 
during the 1960s, demographic, social, ideological, 
and economic pressures led to various reviews of 
education. All the EC countries have made some 
moves to provide comprehensive lower secondary 
education (up to age 15 or 16), but these patterns are 
varied (see table 5-1). Several countries have estab- 
lished comprehensive lower secondary school cur- 
ricula. Denmark and Britain have gone the furthest, 
with 10 years of comprehensive education. Greece, 
Portugal, Spain, Italy, and France also have rela- 
tively long periods of comprehensive education. 
There are some comprehensive schools in Germany 
but, on the whole, the German States have resisted 
the development of a thorough-going con^ehen- 
sive system. Both major components of the tradi- 



^^Tbe a] temative to the academic secoodary school were schools offering technical curricula to prepare studentB for skilled manual occupations. These 
schools also expanded thcii range of ofifcrings as (he numbers of students grew* but (hey typicaJiy provided practical, usually shott-tenn, continuing 
O ucation. 
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Table 5»2— Upper Secondary Students In General Education and In %chnlcal/ 
Vocational Education, by Gender, 1985-86 (In percent) 



QMS Boys 

General Technicat/Vocatlonai General TechnlcalM>cational 
education education education education 



Belgium- 56% 44% 53% 47% 

Denmark 40 60 26 74 

France>> es^" 35 SS^" 42 

Germany^ 51 49 57 43 

Greece 83 17 62 38 

Ireland 79 21 86 14 

Italy^ 26 74* 22 78* 

Luxembourg 38 62 29 71 

Netherlands 49 51 43 57 

Portugal* 99 1 99.8 0.2 

Spain 58 42 53 47 

United Kingdom 53 47 57 43 



Hjomit and upp«r Mcoodnry •ducatlon. 
*>1986-87. 

^ndudM uppar Mcondary ttchndoglcftl •ducatlon. 
<'1M4-85. 

*lndud»t prMchod and primary taachar training. 

^•chnlcalM)catlonal aducatton was abotithad In 1976. New courses wars Introduoad on an axparlmantal k>asf» In 
1963/84. 

SOURCE: Europaan Communlttaa Commission, Girls and Boys In Secondary and Hghar Educational {Bfu^^\s, 
Belgium: 1990), table 3b. 



tional German school structure (the classical gymna- 
sium and the vocational school) have been suffi- 
ciently strong and successful to resist possible 
merging. In particular, vocational education, often 
seen by students as more enticing than the gynmasiunt" 
Abitur-vrnversity route, has been consolidated and 
improved and is generally regarded as a success of 
educational policy.^^ 

Ibday the term ''general education*' is used to 
describe the activities of schools diat include university- 
preparation curricula as well as programs designed 
for students who are not likely to go on to university. 
Nevertheless, the upper secondary level in all 
European countries is still quite differentiated, 
especially in Germany and Italy. (In Italy the system 
is so conq)licated that it has been described as a 
' 'jungle. ' ' As shown in table 5-2, in 8 of the 12 EC 
countries a majority of students follow a curriculum 
of general education, but a sizable number of 
students are in technical/vocational education courses. 
Comprehensive high schools in the United King- 



dom, France, and, to a somewhat lesser extent, 
Germany, have begun to resemble the typical 
comprehensive American high school. 

These shifts toward comprehensive schooling 
have resulted in changed testing policies. Tbday 
none of the EC countries administers a national 
examination at the end of primary schooling.^^ 

Variation in the Rigor and Content of 
Examinations 

Specified examinations for leaving secondary 
school and moving into higher levels of schooling 
vary across locales, kinds of degrees, subject areas, 
and competitiveness of the program or of the 
university. For exPinple, vvhUe the French Bac 
retains a large core o f general education subjects that 
all candidates are required to take (albeit with 
different weights), the 4 options offered in 1950 had 
grown to 53 in 1988.^^ 



i^Madaus and Kellaghan, op. cit., footnote pp. 53''54. 
"md.,p. 55. 

i^nHd. Note, however, that Italy us'ts school-btsed primary examinatioiis set, administered, and scored by the pupils* own teachera. The United 
Kingdom has plans to introduce nationwide assessment at ages 7 and 1 1, but these will be scored by teachers and used for accountability, and are not 
intended to be used for selectioa Some schools in Belgium also administer an examination at the end of primary schooling, but this is a local school 
option, not a national policy. 

'^Information about the Bac was provided to OTA by Sy Ivie Auvillahi of the French Embassy, July 199 1 . See also the fmal section in this chapter 
O more detailed discussion of the French eiuunination system. 
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On the basis of examination performance, a 
candidate is usually awarded a certificate or diploma 
that contains information on performance on each 
subject in the examination in letters (A, B, C, D, E) 
or numbers (1, 2, 3, 4, 5). Usually, grades are 
computed by rmmming marks on sections of ques- 
tions and on clusters of questions or papers. The final 
allocation of grades may also take into account grade 
distributions in previous years. These maiks or 
grades are used in making university admissions 
decisions. 

The certificate or diploma may also confer the 
right to be considered for (if not actu&Uy admitted to) 
some stratum of the social, professional, or educa- 
tional world. Certi^cates are credentials, and certifi- 
cation therefore p lys a dual role: educationally, in 
establishing standards of academic achievement, 
and socially, in justifying the classification of 
individuals into categories that determine their 
shares of educational resources and enq)loyment 
opportunities. 

Because government manages and finances higher 
education, and scholarships often cover almost all 
university costs in some countries, stiff entry compe- 
tition is seen as a fair and appropriate way to 
distribute scarce educational resources. 

Psychometric Issues 

Two major criteria for European examinations are 
objectivit)' and comparability. The central concern is 
whether the examinations reflect what is in the 
syllabus and whether they are scored fairly. Since, as 
noted below, most of the examination questions are 
essay questions that cannot be machine scored, it is 
not surprising that these issues of faimess are 
foremost In the United States, test faimess issues 
have been analyzed primarily through statistical 
methods. This statistical apparatus, known as psy- 
chometrics, has been honed over seven decades of 
research and practice. It attempts to identify item or 
test bias,^^ and determine the reliability and validity 
of tests. Although European educators attempt to 
ensure that examinations reflect what is in the 
syllabus (i.e., content validity) and whether they are 
scored fairly (i.e., reliability), they do not typically 
conduct intensive pretesting and item analysis; 



quantitative models of item-response theory, equat- 
ing, reliability, and validity receive litde or no 
attention* Unlike the United States, Europe does not 
have an elite psychometric community with strong 
disciplinary roots, or an extensive conmiercial test 
industry.^^ Only the United Kingdom has made any 
attempt to iq>ply to their examinations psychometric 
principles of the type developed in the context of 
U.S. testing, and they are still not in widespread use* 

Essay Format and the Cost Question 

Because examinations in European countries 
require students to constroct rathnr than select 
answers, the examinations are considerably more 
expensive to score than the multiple-choice tests 
comnnon in the United States. (Multiple-choice tests, 
on the other hand, are relatively expensive to design. 
See ch. 6 for discussion.) In general, the more 
open-ended a test is, the more expensive it will be to 
score, since scoring requires labor-intensive human 
judgment as opposed to machine scoring. The 
achievement tests used in other countries typically 
assess mastery and understanding of a subject by 
asking studrats to write. A few require oral presenta- 
tions (Germany, France, and foreign language exam- 
inations in many countries). Some of the German 
Abitur requires students to give practical demonstra- 
tions in subjects such as music and the natural 
sciences. 

These tests are expensive — ^to grade them takes 
the time of trained professionals (teachers, examin- 
ers, university faculty, or some combination). For 
example, written examinations taken at age 16-1- in 
Great Britain and Iieland cost roughly $110 per 
student.^^ (In Ireland, candidates pay about 40 
percent of the cost.) These costs may be tolerable in 
countries where a small percentage of the age cohort 
takes the exan. naticm. But in the United States, with 
nearly five times as many students in this age group, 
testing the 3 million 16-year-olds in U.S. schools 
using the British or Irish model would cost about 
$330 million. Looked at from the perspective of one 
State, Massachusetts, it w( ild cost almost $7 
million to test all 65,000 16-year-old-students using 
the model of essay on -land; at present, Massa- 
chusetts spends just $1.2 million to test reading. 



^or ti recent sununary and discussion of the meanings of test bias see, e.g., Walter Haney, Boston College, '"Ibsting and Minorities,** draft 
mottograpb, January 1991. See ch. 6 for an explanation of reliability, validity, and other psychometric concepts. 

^^Madaus and Kellaghau, op. cit., footnote 1, pp. 57**58. 

Q 22ibid,,pp, 30-31. 
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writing, and arit^ aetic achievements of students in 
three grades and three subjects.^ 

An additional factor to be included in a cost 
analysis is the potential effect of tests on retention. 
In the United Kingdom, for exanq)le, many students 
remain in school an extra year to repeat the General 
Certificate of Secondary Education (GCSE) if they 
did not pass the first time, or to repeat the more 
advanced A levels'' if they wish to by for a higher 
grade. 

Tradition of Openness 

Individual test takers in the United States can 
request prior year examinations and sample exami- 
nation booklets for some tests used for selection, i.e., 
the Scholastic Aptitude Tbst (SAT); in addition, 
third-party vendors offer test preparation classes or 
software to enable students to practice for these 
examinations. In general, however, there is a greater 
emphasis on test security in the United States than in 
other countries,^ where both the examinations and 
correct responses are made public following an 
examination and become the subject of much 
discussion. In France, for exanq)le, examination 
questions make fron^ page news, and in Germany, 
answer scripts are retumed to students who may 
question the way diey were graded with their 
teachers. If a problem cannot be resolved between 
the student and teacher, die matter is referred to the 
Ministry of Education. 

In the United States, legal challenges since 1980 
have made the disclosure of college admissions tests 
available to test takers who wish to review them, but 
the examinations are not routinely publicized as in 
Europe. Some observers contend that releasing 
examination questions helps focus student and 
teacher awareness on the facts, concepts, or skills 
required in order to do well on the test, and that 
''teaching to the test'* is dierefore a good thing. 
Multiple-choice exaranations however, which are 
quite inexpensive to score, are very cosUy to 



develop, because of the time and effort spent 
pretesting items and attempting to eliminate various 
biases. Releasing such tests in advance, therefore, 
could jeopardize their validity; this is important 
because of the high costs of creating new items. 

The Changing State of Examinations in Most 
industrialized Countries 

There have been important changes in European 
test policies in the past three decades; many of the 
most dramatic changes have been undertaken in the 
last few years. France abolished centralized examina- 
tions at age 16+ with the aims of postponing 
selection, making assessment more comprehensive, 
and giving a greater role to teachers in assessing 
students. However, th( laminations were reinsti- 
tuted in the 1980s, at least partly because die 
resources to support a school-based system of 
assessment had not been made available to the 
schools.^^ The United Kingdom is overhauling its 
examination system. Even in Japan, where success 
in examinations has been die central feature of the 
educational e;q)erience, politicians and educators 
are debating and reevaluating the form and functions 
of national examinations. 

A major force affecting examination policies has 
been expansion of die educational franchise. Rising 
participation rates and rising expectations of indi- 
viduals with diverse ethnic and socioeconomic 
backgrounds have changed attitudes toward the 
assessment of student progress and the uses of ter is 
for important economic and social decisions. Histor- 
ical criticisms of the narrowing effects of these 
examinations on students' educational experiences 
have become politically significant. Many commen- 
tators always judged tests unsuitable for low- 
achieving students, an argument that has gained 
credence in the light of data suggesting that in order 
to avoid the examinations these students are likely to 
leave school early and enter die labor force widiout 



^It should be noted that the United States has some experience with nationally standardized written examinations. Hie Advance Placement (AP) 
program, for instance, includes tests comprised of short answer and essay items. Currently the AP test costs $65 per subject pa student, paid for in most 
cases by ihe student rather than the school system. This flnancial burden pievents some poor students from taking the tests required for college credit. 
Some States (Florida and South Carolina), pay aU AP fees and others (Indiana and Utah) subsidize or help students in need, but most States have no 
ofiGcial policy, although the Educational Ibstii^ Service leduces the fee to $52 for those with need. Jay Mathews, **Low Income Pupib Find Bxam Fees 
a Real Ibst: California Questions Who Should Foot the Bill,' ' The Washington Post. Apr. 25, 1991, p. A3. 

^Public Law l()0-297,whichauthorize8tbeU.S. Secretary ofEducationtoapprovecomprehensivetests of aca^ 
being conducted in a secure manner, * \ , . the test items remain confidential so that such items may be used hi future tests. ' * This law has been passed, 
but funding has not been appropriated. 

O '^Madaus and Kellaghan, op. cit., footnote 1, p. 60. 
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benefit of any formal cenification.^^ The apparent 
correlation between participation rates and school- 
leaving examination policies is striking: in the 
United Kingdom, for example, the participation rate 
drops from almost 100 percent at age IS to just under 
70 percent at age 16 — ^when examinations must be 
taken. In contrast, some 95 percent of all American 
16-year-olds are still in school (see table 5-3). 

As noted above, a second area where examination 
policies have changed is the elimination of standard- 
ized examinations at the primary level. Furthemiore, 
at the secondary level there has been a move toward 
greater reliance on assessments developed and 
scored by teachers. In four EC countries (Belgium, 
Greece; Portugal, and Spain), national examinations 
have been abolished and certification is entirely 
school based at both primary and secondary levels. 
In other countries, teachers may mark examinations 
set by an outside body or contribute their own 
assessments, which are combined with the results of 
the standardized examinations. This was the pattern 
in Britain from the 1960s onward, and virtually 
every GCSE examination includes an assessmeut (of 
things like oral work, projects, and portfolios) by 
teachers. Although the national program is bringing 
more centralized curriculum to the United Kingdom, 
the national curriculum assessment relies extremely 
heavily on teacher assessments.^*^ 

A third trend has been the shift in emphasis from 
selection to certification and guidance about future 
academic study. This shift has been made possible, 
especially at lower educational levels, by the expan- 
sion of places in secondary schools. Furthermore, as 
the examinations have become more varied, selec- 
tion for traditional third-level education is no longer 
V concem for as many students. Increasing numbers 
are now turning to apprenticeships or technical 
training. 

Other Considerations 

There are other important variables that affect the 
administration, costs, and outcomes of testing. 
These include the numbers of students to be tested, 
preselection of students prior to testing, the homoge- 
neity of the student population and of the teaching 



Table 5-3— Enrollmdnt Rates for Ages 15 to 18 
in the Europeo' Community, Canada, Japan, 
and the United States: 1987-88 





Age IS 


Age 16 


Age 17 


Age 18 


BAloliifn 


958 


95.5 


92.7 




(of whom. Dart*t)rno) . . 


. (2.2) 


(3.6) 


(4.6) 




Denmark . . . 


97 4 


90.4 


76.9 


68.6 


Frano9 


95 4 


88.2 


79.3 


63.1 


(of whom. Dart-tlmA\ . 


(0.3) 


(7 9) 


(10 0) 




Qormanv* 


100,0 


94.8 


81.7 


67.8 


(of whom, part-time) . . 






(0.1) 






82.1 


76.2 


SS.2 


43.6 




. 95.5 


83.9 


66.4 


39.6 


Haly 
















83.4 


71.1 


(of whom, part time) . . 






(15.8) 


(15.8) 




. 98.5 


93.4 


79.2 


59.7 






32.1 


36.9 


29.2 




. 84.2 


64.7 


55.9 


30.4 


United Kingdom 


. 99.7 


69.3 


52.1 


33.1 




. 98.3 


92.4 


75.7 


56.9 




. 96.6 


91.7 


89.3 


3.2 


(of whom, part-time) 


. (2.6) 


(1.9) 


(1.7) 


(1.4) 


United Stales'' 


. 98.2 


94.6 


89.0 


60.4 



f Appr«ntlcMhlp Is cjaesKled as full-time education. 
^1986-87. 

deluding third tsvet. 

^Exdudvs second kvel part-time education. 



SOURCE: Qeorge F. Madaus. Boston CoHaga. and Thomas Kellaghan, St. 
Patricks Cdlega, Dut)ltn, ^'Student Examination Systems In the 
European Community: Ussons for the United States,** OTA 
contractor report June 1 091 , table 5; Information for this table 
from Organisation for Economic Cooperation and Development 
EducmthnhOECDCountrtm, 1907-59 (Paris, France: 1990), 
table 42t oxcspl figures for Portugal which t^rc k>r secondary 
education In 1083-84 and come from European Communities 
Commission, Glrf$ ml Boys in S^oondBry $nd Hlgh^ Bductt- 
Hon (Brussels, Belgium: 1990), table 1c. 



profession, centralization and consistency of teacher 
training to support common standards, and the 
number of days in the school year. These issues need 
to be included in efforts to compare testing policies 
across countries. There is no one model that could be 
described as the European examination system and, 
more importantly, no one model that can be trans- 
planted firom its European or Asian setting and be 
expected to thrive on American soil. 



Lessons for the United States 



What lessons from European and Asian testing 
policies apply to the American scene? To address 
that question OTA focused attention on tliree basic 



^In Britain and Ireland, the number of such students are about U and 8 percent, respectively. Ibid., p, 15. (This estimate appears low to other 
researchers* Max Eckstein, professor of Education, Queens College, City University of New York, personal communication, 1991). 

I^Q^^-^Nuttall, op. cit,, footnote 14. 
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issues: the functions, fonnat, and governance of 
testing.^ 

Functions of Testing 

This report concentrates on three basic functions 
of educational testing: instructional feedback to 
teachers and students, system monitoring, and selec- 
tion, placement, and certification (see ch. 1). Euro- 
pean and Asian testing systems, though different 
from country to country, tend to ctaphmzc the last 
group of functions, i.e., selection, placement, and 
certification.^ There is in other countries almost no 
reliance on student tests for accountability or system 
monitoring, activities that are typically hwdled 
through various types of ministerial or provincial 
inspectorates; this fact itself suggests an important 
lesson for U.S. educators. 

Selection; Placement^ and Credentialing 

If one wished to import testing practices from 
overseas, an obvious strategy would be to expand 
and intensify the use of student testing for selection, 
placement, and certification decisions* Indeed, this 
q>pears to be at least one of the ideas behind some 
proposals for national achievement testing in the 
United States.^ OTA finds that the European and 
Asian experience with testing for these functions 
leads to three important lessons for U.S. poli- 
cymakers. 

First, in most other industrialized countries, the 
significance of testing is greatest at the transition 
from secondary to postsecondary schooling. Stand- 
ardized examinations before age 16 have all but 
disappeared from the EC countries. Primary certifi- 
cates used to select students for secondary schools 
have been dropped as comprehensive education past 
the primary level has become available to all 
students. Current proposals for testing all fourth 
graders with a common externally administered and 
graded examination would make the United States 



the only industrialized countiy to adopt this prac- 
tice.^^ 

Second, the continued reliance on student testing 
as a basis for allocating scarce publicly funded 
postsecondary opportunities has, in Europe and 
Asia, come under intense criticism. Having rela- 
tively recently attempted to relax stringent ele- 
mentary and secimdsuy school tracking systems, 
many countries have been reluctant to hold on to stiff 
examination-based criteria for admission to third- 
level schooling. As a result, admissions policies 
have been in flux. It would be ironic if U.S. 
policymakers, in an attempt to import the best 
features of other countries' models, adopted a 
system of increased selectivity — even at the post- 
secondaiy level— just when ttiose countries were 
evolving in the other direction. 

In this context it is in^nntant to note the funda- 
mental differences in the relationships between 
secondaiy and postsecondary schooling in the United 
States and elsewhere. In most other industrialized 
countries, there is a strong link between secondary 
schools and the universities for which they prepare 
students; in the United States, on the other hand, 
high school graduates face a vast array of postsec- 
ondary opportunities, diverse in their location, 
academic orientation, and selectivity. Although 
periodically in American educational history there 
have been attentats to influence secondary school 
curricula and academic rigor through changes in 
college admissions policies, the postsecondary sec- 
tor in the United State s has remained basically 
independent of the system of primary and secondary 
public schools. Restructuring the linkages between 
these sectors along the lines of the European model, 
and changing the examination system accordingly, 
could bring about changes in the quality of Ameri- 
can high school education; but the benefits of such 
a policy need to be weighed against the uncertain 
effects it would have on the U.S. postsecondary 



^Tbls fhunework was suggested by Mu Eckstein, professor of Education, Queens College, aty University of New York, who chaired an OTA 
workshop on lessons from testing in other countries, January 1991 . 

^Classroom testing, conducted by teachers to assess on a regular basis the progress of their students, is likely to be much the same around the 
world— teacher-developed quizzes, end-of->^4ur examinations, and graded assignments do not vary much from Stockholm to Sacramento, from Brussels 
to Buffalo. 

30See, e.g., Kfadaus and Kellaghan, op. cit, footnote 1, for an overview of national testing proposals. It should be noted that many advocates of 
high-stakes selection and certification tests view dieir principal role as stimulus to improved learning and teaching. Although this might be considered 
a fourth Amction of testing, this report treats the potential motivating effects of tests as a aosscutting issue affecting the utility of tests designed to serve 
any of the three mam functions. 

^^As discussed earlier, tho United Kingdom has implemented a new system of national assessment at ages 7 and 1 1, for purposes of accounUbility 
^^"Stcm monitoring). 



Chapter 5— How other Countries lest • 145 



sector, considered by many to be the best in the 
world,^^ 

The third lesson concerns the equity effects of 
increased testing for what are commonly called 
^'gatekeeping** functions. Europe has a long history 
of controlled mobility amiong nations, and an 
equally long history of efforts to deal with changing 
ethnic and nations! composition of its population. 
What is relatively new in many countries, however, 
is the commitment to widening educational and 
economic opportunities for all citizens. As a result of 
this shift in social and economic e}q>ectations, the 
use of rigorous academic tests as gatekeepers has 
come under fm in many countries. In France, for 
example, the expansion of options under the Bac 
emerged from the struggle of the 1960s to refomi not 
only the schools but much else in Itench society. 

In discussions with many educators and poH- 
cymakers from European countries, OTA found a 
fairly conmion and growing concern with the equity 
implications of educational testing; European (and 
to a lesser extent Asian) education policymakers are 
in fact looking to the United States for lessons about 
how to design and administer tests fairly. Although 
the ultimate resolution of complex equity issues 
escapes predictability, there is no doubt that contin- 
ued cross-cultural and transnational exchanges among 
policymakers and educators grappling with these 
issues will be invaluable. 

System Monitoring 

European and Asian nations tend not to use 
student examinations to gauge the performance of 
their school systems. That function is still handled 
primarily by inspections carrir d out at the ministe- 
rial or provincial government levels. There has been 
heightened interest in using the results of interna- 
tional comparative test score data for policymaking, 
although exactly how to use the data for internal 
policy analysis is a relatively new question.^^ 
Nevertheless, three lessons for the United States 
emerge from the European and Asian experiences. 



First, other countries considering the adoption of 
some kind of test-based accountability system tend 
to view the American National Assessment of 
Educational Progress (NAEP) as a model The fact 
that NAEP uses a sanq)ling methodology, addresses 
a relatively wide range of skills, and is a relatively 
^Uow-stakes'' test make it tqppealing as a potential 
complement to other data on schools and school 
systems* One lesson for American policymakers, 
therefore, is to i^oach changes to NAEP cau- 
tiously (see also ch. 1 for a thorough discussion of 
NAEP policy options). 

The second lesson is to consider nontest indica- 
tors of educational progress that could be valid for 
monitoring the quality of schools. In this regard, 
careful study of the ways in which inspectors operate 
in other countries — how they collect data, what kind 
of data they collect, how their information is trans- 
mitted, how they maintain neutrality and aedibility — 
could be fruitful.^ 

Finally, the European and Asian approach to 
system monitoring suggests a general caution re- 
gardless of whether tests, inspections, or other data 
are utilized. Public perception of 2he adequacy of 
schools in most countries depends on which sc1k>o1s 
are in question: parents typically like what their own 
children are doing, but complain about the system as 
a whole. It is difficult to pinpoint the causes of this 
dual set of attitudes;^^ in any event, it is fairly clear 
that there is greater enthusiasm for reform in general 
than for changes that might affect one's own children* 
Like the **not-in-my-backyard'* (**NIMB\ 'Oprob- 
lem faced by environmental policymakers, edu- 
cation policymakers in many countries^ face a 
formidable ''NIMSY'' problem: educati<m reform 
may be OK, so long as it is ^'not-in-my-rchool 
yard/' American, European, and Asian educators 
and policymakers who have struggled with the 
NIMSY problem in their attempt to respond effec- 
tively to analyses of various types of system 
monitoring data could leam much from one another. 



^^See Klnt, op. cit., footnote 4, for discussion of the quality of U.S. collies and universities. 

3^1116 Organisation for bcooomic Cooperation and Development (OBCD) has been sponsoring, alopg with the U.S« Department of Education, an 
oogoii kg collaborative effort to hotter understand and utilize conqMurative data on student achievement 

or discussion of multiple itdicators of education, see U.S. Department of Education, National Center for Education Statistics, Education Counts: 
An Indicator System to Monitor tKe Nation's Educationai Health (Washington, DC: 1991). 

^H>oc explanation that caused a & tir in policy circles was the finding that statewide achievement scores in every Sta^e were above the national average, 
O .liscussion in ch. 2 of this report. 
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Test Format 

In European countries, the dominant form of 
examination is ''essay on demand/* These are 
examinations that require students to write essays of 
varying lengths. Use of multiple-choice examina- 
tions is limited, except in Japan, where multiple- 
choice tests are common at all levels of elementary 
and secondary schooling and are used as extensively 
as in the United States. Performance assessments of 
other kinds (demonstrations, portfolios) may be used 
for internal classroom assessment, but not generally 
for systemwide examinations because of costs. 

The lesson from this mixturv. of test fomiats 
overseas is a complicated one. On the one hand, 
European experience could lead American poli- 
cymakers to eliminate, or at least reduce signifi- 
cantly, multiple-choice testing; surely some critics 
of U.S. testing policy would embrace this position. 
But this inference would be erroneous, given the 
conflicting evidence from the overseas examples. 
For example, if one of the purposes of testmg is to 
raise standards of academic rigor, the French and 
Japanese examplles offer conflicting models: both 
countries typically rank higher than the United 
States in comparisons of high school students' 
achievement, but they rely on diametrically diifferent 
methods of testing. 

If there is a lesson, then, it is that testing in and of 
itself cannot be the principal catalyst for educational 
reform, and that changes in test format do not 
automatically lead to better assessments of student 
achievement, to more appropriate uses of tests, or to 
improvements in academic performance. The fact 
that European countries do almost no multiple- 
choice testing is not, in itself, a reason for the United 
States to stop doing it; rather it is a reason to consider 
whether: a) reliance on the multiple-choice format 
satisfies the numerous objectives of testing; and b) 
whether alternative formats in use in other countries, 
such as essays and oral examinations, could better 
serve some or all objectives of testing in the United 
States. 

In considering alternative test formats and the 
experience of other countries, it is important to ke-ep 
two additional issues in mind. First, as discussed in 
chapters 4 and 8 of this report, the combination of 
multiple-choice and electromechanical scoring tech- 



nologies made the concept oC^ass testing in the 
United States economically feasible. To the extent 
that this type of testing went hand in hand with the 
American commitment to schooling for all, it will be 
interesting to observe whether increased efficiency 
of test format will evolve as an important considera- 
tion in European countries committed to expansion 
of school opportunities for the masses. 

Second, one of the important advantages of the 
multiple-choice format is that tests based on many 
different questions are usually more reliable and 
generalizabit^ than tests based on only a few ques- 
tions or tasks.^^ It allows for statistical analysis of 
test reliability and validity both before and after tests 
are administered. In addition, multiple-choice tests 
allow for statistical analysis of items and student 
responses, not as easily accomplished with perform- 
ance assessments. If criteria such as reliability and 
validity remain a central concern among American 
educators, the adoption of European testing methods 
will necessitate substantial investments in research 
and development to bring those methods up to 
acceptable reliability and validity standards. 

Governance of Testing 

None of the countries studied by OTA has a 
single, centrally prescribed examination that is used 
for all three functions of testing. Moreover, the 
countries of Europe and Asia exhibit considerable 
variation in the degree of centralized control over 
curriculum and testing. In some countries, there are 
centrally prescribed curricula that are used as a basis 
for the standardized examinations students take, 
while elsewhere decisionmaking is more decentral- 
ized. An obvious lesson, then, is that the concept of 
a single national test is no less alien in other 
countries than it has been in the United States. 
Nevertheless, there are important differences in the 
governance of tests between the United States and 
other industrialized countries. 

Testing and Curriculum 

Although most countries allow some local control 
of schooling, in general there is greater national 
agreement over detailed aspects of curriculum than 
there is in the United States. This sense of a shared 
mission is reflected in tests that probe content 
mastery at much deeper levels than most of the 



^ . ^Scc discussion of gencnUizability in ch, 6, 
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standardized tests in the United States.^"^ As ex- 
plained elsewhere in this report, however, this has 
more to do with the politics of testing than with the 
technology of testing: the United States has a long 
history of decentralized decisionmaking and school 
governance, and an aversion to the idea of curricula 
defined for the Nation as a whole. Standardized tests 
that can be used across the United States have 
therefore been limited to skills and knowledge 
conmion to most school districts — which has meant 
basic reading, writing, and arithmetic.^^ The pursuit 
of consensus in the United States for anything 
beyond the basics has proved difficult, though not 
impossible; the best example to date is NAEP, 
considered by most educators who are familiar with 
it as an important complement to the kinds of 
information provided on nationally normed stand- 
ardized tests. Nevertheless, even NAEP items fall 
short of the complexity, depth, and specificity of 
content material attained in written examinations 
overseas. 

Three important lessons regarding governance of 
tests emerge for U.S. policy. First, consensus on the 
goals and standards of schooling appears easier to 
establish in Europe and Asia than in the decentral- 
ized and diverse U.S. education system. As a 
consequence, national examinations in Europe and 
Asia can be very content and syllabus specific. In the 
United States, on the other hand, achieving national 
consensus usually means limiting examinations to 
basic skill areas common to 15,000 school districts. 
Even NAEP, which consists of items derived from 
elaborate consensus-seeking processes, does not 
assess achievement at a level of detail and complex- 
ity comparable to typical essay examinations in 
other countries. The lesson fi'om abroad, then, is that 
syllabus-specific tests can be national only in 
countries where curriculum decisions are made 
centrally or where consensus can be easily attained. 

The second lesson, related to the first, concems 
the sequencing of curriculum and test design. 
European and Asian experience does not demon- 
strate that national testing raises the academic rigo:i 



of curricula, but rather that national consensus on 
goals and standards of schooling allows for consist- 
ent curricula that can be tested by syllabus-based 
national examinations. Indeed, the importance of 
keeping the horse of curriculum and instruction 
before the cart of assessment (one of OTA's central 
findings in this report) is reinforced by the overseas 
experience. 

The third lesson concems the effects of heavily 
content-driven examinations on student behavior. 
Syllabi, topics, criteria of excellence, and questions 
from prior examinations are widely publicized in 
other countries, where preparing for tests is encour- 
aged. This emphasis on curricular content conveys 
an important signal to students in Europe and Asia: 
''study hard and you can succeed.'' In the United 
States, students are encouraged to work hard, but 
their success in gaining admission to college or in 
finding good jobs often depends on many other 
factors besides their performance on tests closely 
tied to academic courses they have taken. While 
there is clearly a need for tests that can assess fairly 
the differences in knowledge and skills of individu- 
als from vastly diverse and locally controlled school 
environments,^^ there may also be considerable 
merit in the use of examinations that reinforce the 
value of studying material deemed worthy of leam- 
ing.40 

The Private Sector 

Only in the United States is there a strong 
commercial test development and publishing mar- 
ket. The importance of this sector, in terms of 
research, development, and influence on the quality 
and quantity of testing, cannot be overstated. Even 
when States and districts create their own tests, they 
often rontract with private companies. In Europe 
and Asia, testing policies reside in ministeries of 
education. 

There is a certain paradox about the preference for 
public administration of tests in other countries and 
private markets in this country. Given that European 
and Asian countries typically have less trouble than 



^Se^ eg.. National Endowment for the Humanities, National Tests: WItat Other Countries Expect Their Students to Know (Washington, DC: 1991), 
for examples of test questions faced by students in Europe and Ji^an. 

^For discussion of how multiple-choice items can assess certain "higher order thinking skills** see cb. 6. 

^See Donald Stewart, "Hiinking the Unthinkable: Standardized Ibsting and the Future of American Education,** speech before (he Columbus 
Metropolitan Cfub, Columbus, OH, Feb. 22, 1989. 

^<^s issue turns on distiinctions between aptitude testing and achievement testing (see ch. 6). For discussion of the historical development of tiiese 
Q aches to lestipg, see ch. 4. See also James Fallows, More Like Us (Boston, MA: Houghton-Mifflin, 1989). pp. 152-173. 
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the United States in defining national goals and 
standards of education, the abulty to specify testing 
needs and contract with private vendors for test 
development and production ought to be relatively 
easier in other countries than in the United States. 
the other hand, given that fragmentation in curricular 
standards and educational goals in the United States 
raise formidable barriers to market transactions, one 
might expect greater reliance on nonprofit or gov- 
ernmental organization of testing* 

The Role of Teachers 

Considerable responsibility is vested in teachers 
in other countries for the administration and scoring 
of standardized exanrdnations. This practice is based 
on the premise that examinations with heavy empha- 
sis on academic content should be developed and 
graded by professionals charged with delivering that 
content and respected for their ability to ascertain 
whether children are learning it. The important 
lesson for U.S. testing policy, then, is that faith in the 
professional caliber of teachers is a necessary 
condition for a credible system of examinations that 
requires teachers* judgments in scoring. 

It is important to note that many European 
countries have only one or very few teacher training 
institutes, guaranteeing more consensus on the 
principles of pedagogy and assessment than in the 
United States, wheie teacher education occurs in 
thousands of colleges and universities. The central- 
ized model of teacher training in other countries 
reinforces the professional quality of teaching, and 
makes it relatively easier to implement national 
curricula. The American tradition emphasizes stand- 
ardized testing as a source of information to check 
teachers' judgments and to assure that children in 
diverse schools and regions are being treated equita- 
bly. The lesson from the European model, then, is 
that a centralized system of teacher preparation can 
increase the homogeneity of teaching and curricu- 



lum and reduce the need for assessments designed to 
assure that all children are receiving similar educa- 
tional experiences. This suggests a familiar theme: 
changing testing will not necessanly improve teach- 
ing, but changes in teaching can lead to different 
sq)proaches to testing. 

U.S. policymakers wishing to adopt examinations 
on European or Asian models will need to balance 
the need for increased reliance on teacher judgments 
with public demand for a system that provides an 
independent second opinion," especially when 
test results have high stakes. 

Snapshots of Testing in Selected 
Countries^i 

The People's Republic of China 

The first examinations 
/\ were attributed to the 
vp^"^ Sui emperors (589-618 

/ A.D.) in China. With its 

Sk£ y^^^^ flexible writing system 
( isT^^^ ) extensive body of re- 

tJ^^ Sg73Lf^ corded knowledge, China 
^ was in a position much 
earlier than the West to 
develop written exami- 
nations. The examinations were built around candi- 
dates' ability to memorize, comprehend, and inter- 
pret classical texts/^ Aspirants prepared for the 
examinations on their own in private schools run by 
scholars or through private tutorials. Some took 
examinations as early as age 15, while others 
continued their studies into their thirties. After 
passing a regional examination, successful appli- 
cants traveled to the ci^tal city to take a 3-day 
examination, with answers evaluated by a special 
examining board appointed by the Emperor. Each 
time the examination was offered, a fined number of 



the folio <viag country profiles all data on area and total population come from Mark S. Hoffoian (ed.)< The World Almanac and Book of Facts, 
1991 (New Yor); NY : Pharos Books. 1990); age of compulsory schooling and total school enroUment figures come from the United Nations EducationaU 
Scientific and rjultunl Organization (Unetco), StaHsHcal Ytarbook (Louvain« Belgium: 1985 and 19S9). School enrollment ligures inchide "pfA-Oitt 
level;* "first levelt** and "second level** students. Data on number of school days comes from Kenneth Redd and Wayne Riddle, Congressional 
Research Service, ' 'Comparative Education: Statistics on Education in the United States and Selected Foreign Nations/* 88*764 EPW, Nov. 14, 1988. 

For comparison purposes, current U.S . data are: size, 3.6 million square miles; population, 247 J million. Mark S. Hoffinan (ed.)i The World Almanac 
and Book of Facts, 1990 (New York, NY: Pharos Books, 1989). School enroUnmit: 46.0 million. U.S. Department of Education, National Center for 
Education Statistics, The Condition of Education, 1991, vol 1, Elementary and Secondary Education (Washington, DC: U.S. Ck)vemment PrintiAg 
Offlce, 1991). 

^^Stei^ P. Heyneman and Ingemar Fagerlind, ''Introduction,** in The World Bank, University Examinations and Standardized Ttsting 
^ Washington, DC: 1988), p. 3. 
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Size 3,705,390 square miles. 

sllgMty larger than the United 
Statee 

Population 1,130,065,000 (1990) 

School enrollment 177.8 mlHIon (1988) 

Age of compulsory 

schooling 6 to 16 

Number of school days September 1 to mld>lMly— 

exact number of days not 

avalleNe 

Selection points and m&^or 

examinations 1. Provincial examinations at 

end of 9th year of 
compulsory schooling 
2. Central examinations set by 
the State for university and 
cdiege entrance 
Curriculum control National, central c trol 

aspirants were accepted into the imperial bureauc- 
racy.^^ 

Education in China today is largely centrally 
controlled. Curricula and die examinations that 
acconnpany them are used as a reflection of political 
philosophy and as a means of maintaining cultural 
cohesion, as well as to reinforce conunon loyalties 
in a population of over 1 billion people, speaking 
several major languages, distributed over a huge 
land mass (larger than the United States). Theie 
remains a shaip separation between academic school- 
ing and vocation^ schooling, and examinations are 
the basis for making these selections at the end of the 
9 years of conqpulsory schooling. Students may then 
enter general academic schools, vocational or tech- 
nical schools, or ^^key schools,'' which accept the 
top cadre of students and receive superior resources 
in part based on the test results of their students. The 
examinations at this bvel are prepared by provincial 
education bureaus and are adminip^'^red on a city- 
wide basis. 

At the end of upper secondary school, students 
seeking university entrance take a centralized exam- 
ination that provides no choice of subjects, speciali- 
zations, or options. This examination is developed 
by the National State Education Commission and 
administered by provmcial higher education bureaus 
who assign candidates to schools based on scores, 
specialties, and places available. The same is true for 



technical schools. The Central Ministry of Labor and 
Personnel develops and administers a nationwide 
entrance examination for skilled worker schools. 
Strict quotas are assigned for overall opportunities 
for further study and to particular programs at 
specific institutions, based on a master plan of 
national and regional development goals, The size, 
wealth, and general power of certain municipalities 
(Beijing, Shanghai, and Tientsin) have enabled them 
to assume control over the examination mechanism, 
which in other locations may be directed by the 
central or provincial authority. 

Tlie number of candidates for university entrance 
is huge — ^in 1988, 2.7 million students prepared for 
the national college admission test. Less than 
one-quarter were accepted for study. Overall, about 
2 percent of Chinese first graders eventually go on 
to higher education.^ The format of the examina- 
tions, once extended answer/essay format, is begin- 
ning to change to short-answer and multiple-choice 
questions. Nevertheless, examinations are still 
scored by hand rather than machines. Some analysts 
suggest that, given the huge numbers of examinees, 
it is only a matter of time before machine-scorable 
formats are introduced, reinforcing the already 
strong emphasis in Chinese schools on rote learning 
and recall of facts.^^ 

The pendulum of Chinese higher education ad- 
mission policy has swung with political pressures. 
After 1 ,000 years and a well-established tradition of 
using examinations to control admission to higher 
education and further training, the Chinese abol- 
ished examinations during the cultural revolution, 
with the goal of eliminating status distinctions. 
Selection was to be based instead on political 
activism and ''correctness** of social origin. The 
pendulum swung back again with the new regime in 
1976, when examinations were reestablished as a 
means of allocating university places on basis of 
merit. Student scores rather than political orthodoxy 
have again become the major criterion to advance- 
ment. Examinations confer status in China. It is not 
uncommon to inquire about a persons* status in 



43willjAin K. Ctimmliigs^ "EvaluaUon and Examination,** International Comparative Education Practices: Issues and Prospects, Thomas Murray 
(cdO (Oxford, England: Pcrgamon Press, 1990), p, 90. 

^Harold J.Noah and MaxA. Eckstein, ••TradcoffslnExaminationPolicles: Anlnxctm^oMCompmi^ytPetspcciiye **Oj^ord Review cfEd^ 
VOL 15, No. 1, 1989, p. 22. 

^Ibid. 
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society by asking: ''How many examinations has he 
(or she) passed?' 

The Union of Soviet Socialist Republics 
(U.S.S.R.)*7 

Soviet society has been 
characterized by central 
control and planning, and 
this centralization extends 
to the educational sys- 
tem.^ The 15 republics 
and subrepublics that 
made up the U.S.S.R. 
had shaml a central cur- 
riculum and common 
school organization. Considerable local discretion 
had been provided, however, in education policy as 
it pertained to the secondary school-leaving certifi- 
cate, the attestat zrelosti (maturity certificate). This 
certificate was based on accumulated course grades 
and an examination that was predominantly oral in 
nature. Each of the IS republics was responsible for 
setting the content and standards of the examination, 
and the teachers who prepared the students domi- 
nated the process of setting the questions and 
evaluating die responses.^' 

Because there was so little comparability in 
graduig, the value of the attestat zrelosti meant 
different thmgs in different parts of the country. As 
a result of this variability, the VUZy (universities 
and technical institutes) developed their own en- 
trance examinations. Much like in the Japanese 
system, each university set its own questions, testing 
schedule and policy, cutoff score, and grading 
procedure. This diversified system placed a burden 
on students, who needed to negotiate a web of 
uncoordinated examinations, and travel great dis- 
tances to sit for the necessary examinations at the 
university or institute of their choice. Much of the 
examination process involved oral examinations. 
The system was described as erratic, inconsistent, 
confusing, and subject to influence peddling and 



Size 8,649,496 square miles, the 

largest country In the world, 
approximately 2.5 times the 
size of the United States 

Population 290,939,000 (1990) 

School enrollrvK^nt 4.9 million (1988) 

Age of compulsory 
schooling 7 to 17 

Numt)er of school days Septemt>er 1 toMaySO— exact 

numt>er of days not available 

Selection points and msjor 

examinations 1 • Secondary school-leavlng 

examinations set by each 
republic, graded by local 
teachers 
2. Each university and 
technical Institute sets Its 
own entrance examination 
Curriculum control National, central control 

corruption. There were persistent reports of discrim- 
ination against ethnic and religious groups in the 
exanodnation process.^^ 

Controlling the flow of students into the univer- 
sity system was part of the overall regional and 
national planning that had been carried out through 
test quotas. During the revolution of 1917, univer- 
sity entrance examinations were abolished, and 
access was opened to all students. However, the 
examinations were reinstated in 1923,'^^ The more 
recent balance between central planning and local 
flexibility was another example of the need for 
political compromise. Some maintained that the 
tradeoff for local flexibility had been an incoherent 
and inconsistent system. In part to And more 
objective and standardized forms of testing, Soviets 
had begun looking to ''American tests,'' machine- 
scorable multiple-choice tests, for possible use in the 
attestat irelosti. It is not clear how the various 
republics will react to relinquishing some of their 
local discretion in developing and scoring tests. As 
noted above, it is yet to be seen how the independ- 
ence of the Soviet republics will affect the examina- 
tion systems that were developed to serve the 
centraUzed political system of the past. 




^Eckstein and Noah« op. cil., footnote 6, p. 308. 

^^Tbis stu4)sbot infers to the period before the recent bieakup of the U.S.S into ceparate republics. 

^Education and examination proce^ises are undergoing radical changes and it is too soon to draw final conclusions. V. Nebyvaev, third secretary, 
Bmbassy of the Union of Soviet Socialist Republics, personal communicaUon, July 31, 1991. 
^^oah and Eckstein, op. cit.. footnote 44, p. 23. 
»Ibid. 
O ^^Ibid. 
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Japan 

When the United 
States compares itself to 
Japan, it is common to 
bemoan the fact that our 
schools are not more like 
theirs. Interestingly, one 
of the few things the two 
education systems have 
in common is their reli- 
ance on machine-scorable multiple-choice examina- 
tion s. In other ways our cultures and traditions are so 
different that many comparisons are superficial and, 
in some cases, potentially destructive.^^ 

When Japan emerged from its feudal period in the 
mid- 19th century, it began to look to the West for 
models to modernize aspects of Japanese life.^^ 
Among these models were the Western goals of 
compulsory primary education and of a high-quality 
university system. Japan also followed the French 
example of a centrally prescribed curriculum and 
textbooks, frequent testing during a school year, and 
end-of-year final tests. However, since Japanese 
students often finished the prescribed curriculum 
before the end of the school year, they began to focus 
on the use of entrance examinations for the higher 
level, rather than school-leaving examinations from 
the lower level. These entrance examinations be- 
came valued for several reasons. The fu*st and most 
obvious was the need to select a few students from 
the many seeking higher levels of education. An- 
other reason for devotion to examinations came from 
the uniquely Japanese cultural disposition known as 
ie psychology, **. ..the tendency to rigorously 
evaluate individuals before permitting them to join 
a family system or a corporate residential group, but 
once they are admitted, to accept ^ adjust to them 
as fiill members. ''^'^ This concept of first passing 
rigorous scrutiny and then receiving what becomes 
lifetime acceptance into establisheu groups can be 
seen in acceptance of spouses into a family unit or 
employees into membership in Japanese firms.^^ 



Size 145.856 square miles, slightly 

smaller than California 

Population 123.778,000 (1990) 

School enrollment 21.2 million (1988) 

Age of oompulsory 

schooling 6 to 15 

Number of school days 243 

Selection points and mcyor 

examinations 1. Examinations for entry to 

some Junior high and high 
schools 

2. Joint F^st Stage 
Achievement T^st: national 
preliminary qualifying 
examination for national local 
public universities 
(approximately 49 percent 
of all university candidates); 
abolished In 1989 and 
replaced with Test of the 
National Center for 
University Entrance 
Examinations for public 
universities (and some 
private universities) 

3. Each university sets own 
College Entrance Examina- 
tions 

Curriculum control National, central control 



The second major reform in Japanese schooling 
was implemented by the American occupation 
following World War 11.^^ The School Education 
Law of 1947 caused a massive reorganization of the 
existing school facilities that is the basis for today's 
educational system. Among these reforms were the 
establishment of a 6-year compulsory primary 
school and 3 additional years of a compulsory 
middle or lower secondary school. The first 9 years 
of compulsory education are free to all students. An 
additional 3 years of high school are modeled on the 
lines of the American comprehensive high school; 
however, all high schools charge tuition. While the 
law said that ^^ . . co-education shall be recognized 
in education/' many private junior high or high 
schools and some national and public local high 
schools are for one gender.^*^ 

Higher education also was to be reformed, with 
the aim of broadening goals, leveling the traditional 



^^StCt e.g., Fallows^ op. cit,, footnote 40. 

^^While the education system imported the **pnictical** disciplines (mathematics* Kiencc, and engineering) from the West, its moral content was 
strictly Japanese. The 1890 Imperial Rescript on Education made ' ' the teachings of the ancestors of the Imperial Family* * the basis for all instruction. 
"Education Reform in Japan: Will the TWrd Time be the Charm?** Japan Economic Institute Report, No. 45A, Nov. 30, 1990, p. 2. 

^William K. Cunmilngs, **Japan,** in Murray (cd.), op. cit., footnote 43, p. 131. 

^••Education Reform in Japan,** op. clt., footnote 53. 
O /Article 5 of the Rindamentai Uw of Education, Horie, op. cit., footnote 10. 
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hierarchy, expanding opp9^!unities, and decentraliz- 
ing control. While many of the reforms envisioned 
for changing higher education were not long-lived, 
opportunities were vastly expanded, and important 
powers devolved to universities, e.g., power over 
academic s^pointments, admissions, and so on. The 
postwar constitution formally guarantees academic 
freedom, and university autonomy is held sacred. 
Nevertheless, the government controls the purse 
strings for national universities, and ties between 
large employers and the national universities have 
led to a peipetuation of the hierarchy in Japanese 
education.^^ 

Japanese education today is highly centralized, 
with a common curriculum and liule choice in 
subjects. Ibst scores become important early and 
throughout the structured progression of students 
along a carefully defmed path. Some suggest this has 
had the impact of transforming Japan from an 
aristocracy to a society where what counts is the 
university one attends.^^ There is a progression, 
based on examinations, that has provoked consider- 
able competition among students and their parents. 
While primary schools are quite egalitarian, many 
students compete for the more elite national junior 
high schools that grant entrance based on test scores 
and, in some cases, a lottery. There are also many 
private junior high schools whose entrance examina- 
tions are very competitive. It is hoped that success in 
an elite junior high will help guarantee entrance to 
the best high schools. There is space for approxi- 
mately 60 percent of all the students in public high 
schools; private schools receive the rest.^ 

Since there is now room for all students to attend 
high school of some sort, and since the curriculum 
is centralized, based on the university entrance 
examinations, today there is somewhat less ccxnpeti- 
tion for high school entry than in the past. But those 
high schools (public and private) with larger num- 
bers of successful university applicants are still 
prized. Student selection to high school is based on 
prior grades and teacher recommendations as well as 
the high school entrance examination. With recent 



education reforms, some of the pressure of this first 
stage of Japan's examination system has been 
reduced. 

While the entrance examination system for Japa- 
nese universities has been in existence for over a 
century, the pendulum of common examinations v. 
university-developed examinations has swung back 
and forth. In the prewar period, an entrance examina- 
tion was used only for those prestigious national 
universities that attracted large numbers of appli- 
cants. The private institutions did not require Uiese 
examinations. With the postwar educational re- 
forms, a single common examination, the Jiq)anese 
National Scholastic Aptitude Ibst, was instituted for 
all universities. This examination was abolished in 
1954 and replaced by a system whereby each 
university conducted its own entrance examination. 
School grades and recommendations from high 
school teachers were not given much weight, and 
eventually educators became concerned that the 
university entrance examinations did not adequately 
cover the scholastic ability of applicants.^^ 

In 1979, therefore, a new system was put jjito 
place that eventually led to today's two-tiered 
examination system. The fu:st stage required all 
applicants to national and local public universities 
(currently approximately 49 percent of all 4-year 
college applicants'^) to take the Joint First Stage 
Achievement Ibst (JFSAT), a retrospective exami- 
nation created by tlie Ministry of Education. This 
examination was offered once a year to test mastery 
of the five major subjects in secondary school 
curriculum. In 1990, the JFSAT was abolished and 
replaced by the Ibst of National Center for Univer- 
sity Entrance Examinations (TNCUEE). The main 
difference between these two tests is use and 
content. The JFSAT was required of applicants to 
national and local public universities only, while the 
TNCUEE is taken by some applicants for some 
private universities as well. In addition, the TNCUEE 
requires applicants to take examinations only in 
those subjects required by the universities to which 



^WUliam Cummiogs, Harvard University, personal communication, August 1991. 

^ (he United S tates and Korea, having the credential or degree is what counts in terms of prestige and career poss ibilities. In Jc^an, though, the status 
stems from attending a university: it is more important to be ''Tbdai Man* —to attend Tbkyo University, than to earn a Ph.D. James Fallows, personal 
communication, July 18, 1991. 

^umjnings, op. cit., footnote 58. 

**Ikuo Amano, '^Educational Crisis In Japan,'* in Cummings ct al. (eds.), op. cit., foomote 7, pp. 38-39. 
^ ^^^Horie. op. cit., footnote 10. 
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they are applying.^^ The second tier of examinations 
is the College Entrance Examinations (CEE), individ- 
ually developed, administered, and graded by the 
faculties of each of the prestigious and highly 
selective universities. 

While 34 percent of high school graduates seek 
university entrance, only 58 percent of these appli- 
cants gain entrance.^ One-tUrd^^ of the applicants 
each year are ronin, ''masterless samurai," who are 
repeating the examinations after attending special 
prep schools (yobiko) and juku (tutorial, enrichment, 
preparatory, and cram schools) in order to get higher 
scores, qualifying them for admission into the 
prestigious universities. 

In fact, the juku, or cram school, and the yobiko 
have become almost a parallel school system to the 
public schools. The sole curriculum of these after- 
hours or additional schools is examination prepara- 
tion. There are 36,000 juku in Japan. It is a $5-billion 
a year industry. More than 16 percent of the primary 
school children and 45 percent of junior high 
students attend juku^ even though the extra school- 
ing costs several hundred dollars a month and 
represents a significant financial burden ibr many 
families.^'^ In fact, with competition even to gain 
entry into some of the most successful cram schools, 
some of which give their own admission tesis, there 
are jokes about going to juku for juku. 

There has been a great deal of concern ii Japan 
about the impacts of ^^exam hell'' in two re£;ards — 
the impact on students and the impact on curriculum. 
In Japan, high school is not the time of exploration 
and discovery, socialization and extracurricular 
activities, football games and dating that is found in 
the American high school. Instead, students; spend 
almost every waking hour in school, in juku, or at 
home studying. The school day is long aiid after 
school children go to juku; the school week extends 
through Saturday morning, and the school year is 
approximately 240 days long. Pressure is great and 
continuous until a student makes the final cut — 



entrance into a prestigious university. One popular 
saying is: ^^Sleep four hours, pass; sleep five hours, 
faU.''^ 

Other impacts are more subtle, but of equal 
concern: students who memorize answers but cannot 
create ideas, and a curriculum that focuses every- 
thing on preparation for the examinations. When 
students view schooling as ^ \ . . truly relevant when 
it promotes preparation for the CEE and as only 
marginally useful when it does not contribute 
directly to university admission,"^^ this has a major 
cognitive and motivational impact on students' 
approaches to education. It is not clear whether a 
love of learning for learning's sake can be inspired 
later, once the student jumps the final hurdle and 
makes it to the home stretch of the university. 
Indeed, once accepted into college, students can take 
it easy and relax, discover the joys of the opposite 
sex and peifaaps begin to rediscover some of the 
pleasures forsaken in their ^Uost childhoods.'' In 
fact, the college period in Japan has often been 
referred to as a ^'4-year vacation," although a well 
earned one, since the average Japanese student ranks 
at the top of the list in mathematics, science, and a 
number of other subjects in international compari- 
sons.'^^ 

France 

The locus of control 




for education in France 
is the Ministry of Educa- 
tioa (MOE). The curric- 
ulum, topics for exami- 
nations, and guidelines 
are set by MOE, with 
examination questions 
and overall administra- 
tion coordinated by the 



32 regionally dispersed academies. The Minister of 
Education sets a general program of what should be 
examined, but each academy is responsible for 



*3lbid. 
«Ibid. 
"Ibid. 

^arol Simons. '*Thcy Ocl by With a Lot of Help From Their Kyolku Mamas/* Smithsonian, vol. 17, March 1987, p. 49. 
^''Fallows, op. cil., footnole 59. 
^•Simoiis. op. cil., footoole 66, p. 51. 

^Nobuo Shiinahara, '"The CoUcge Entrance Examination PoUcy Issues in Japan/* Qualitative Studies in Education, vol. 1, No. 1, 1988, p. 42. 
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Size 220,668 square miles, about 

twice the size of Colorado 

Population 56,184,000 (1990) 

School enrollment 9.6 million (1988) 

Age of compulsory 

schooling 6 to 16 

Number of school days 185 

Selection points and ms^or 

examinations 1. State-controlled brevet at end 

of comprehensive school 
(age 15) 

2. BaoGBUlearet at completion 
of /yoee (age 18), 38 
options, 3 types of dlploma« 
set by each regional 
academy with Ministry of 
Education (MOE) oversight 

3. Admission to selective 
prandeseco/esvia 
c<H)!x>ur8 after 1 to 2 
more years 

Curriculum control National, central MOE control 

administering the curricula and testing within a 
region^* 

French students spend S years in the ecole 
primaire, or primary school, and move to the 
secondary school without taking a graduation or 
selection examination. However, there has been a 
recent interest in examining students to see how well 
the schools are doing. At the beginning of the 1989 
school year, MOE» concerned with reports showing 
a large proportion of students (30 percent) with 
reading problems on entering secondary school, set 
out on an ambitious national examination that could 
be compared with the U.S. NAEP.''^ Inspectors, 
teachers, and specialists from all across France 
gathered and created a matrix of national goals and 
achievement levels. Ibachers submitted ideas for 
questions and, after a period of pretesting, the group 
developed a common standardized test for mathe- 
matics, reading, and writing at the third and sixth 
grade levels. All 1.7 million students in these grades 
were tested in their classrooms, and teachers admin- 
istered and scored the tests using coded answer 
sheets. Since the goal was to diagnose individual 
problems, every student was tested and the results 
were sent to parents. Each teacher was given copies 



of the exercises (a mixture of open-ended and 
multiple-choice questions) with discussion of the 
objectives, commentary on kinds of responses stu- 
dents made, and overall scoring results. Although 
summative national results were collected, there was 
to be no classification or comparison made between 
classrooms, schools, and regions. A fcUowup to this 
examination was planned for September 1991, using 
a sampling of students rather than an every student 
census.^^ 

Democratic reform implemented some 15 years 
ago has meant that almost all II -year-olds begin 
sixth grade in comprehensive secondary schools 
{college) of mixed ability levels. At the completion 
of comprehensive school, examinations for the 
brevet de college (college certificate) are given in 
three subjects: French, mathematics, and history/ 
geography. The brevet examinations were abolished 
in 1977 and completely replaced by a school-based 
evaluation. However, because of concern with 
decUning results and complaints about what it meant 
to complete secondary school, the brevets were 
reestablished in 1986. At present, graduation from 
secondary school is based on a combination of 
examinations controlled by the State and an evalua- 
tion by the school.'''^ 

A common curriculum has been an expression of 
the value placed on the ideal of a unitary, cohesive, 
clearly defmed French culture. Some have suggested 
this unity was won at the price of official neglect of 
minority ^jid regional cultures within the country.''^ 
But this 3 changmg, and nowhere is this change 
better reflected than in the discussion of what 
subjects should be taught at the lycee (the third level 
of schooling) and for the Baccalaureat (Bac), taken 
at the completion of the lycee. While once the focus 
was to provide the French culture generale, a 
common French culture through a central curricu- 
lum for the few who could demonstrate a high level 
of formal academic ability in literature, philosophy, 
and mathematics, this attitude has changed dramati- 
cally in recent years. 



'^^Henk PJ, Kreeft (ed.)t **Issue8 in Public ExamiiuiUoiu,** paper prepared for the iDlemationai Association for Educational Assessment^ 16th 
International Conference on Issues in Public Exat;«inalionS| Maastricht^ The Netherlands, Juno 18-22i 1990, 

''^Marten Lc Ouen and Catherine Lacronique. * *EvikWtion CE-6eme, A Survey Report of Assessment Procedures in France on Maihcmatics. Reading 
and Writing/* paper prepared for the International Association for Educational Assessment, 16th International Conference on Issues in Public 
Examinatious, Maastricht, The Netherlands, June 18-22, 1990. 

'^Ibid., p. 4. 

^^Krocft, op, cil„ footnote 71, p. 16, 
O ^^Eckstein and Noah, op. cil„ footnote 6, p. 312, i « ^ 
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Current practice has been movmg to reduce the 
uniformity and increase variety and options. Since 
1950, the French have changed the Bac radically in 
order to nnieet demands for a more relevant set of 
curricula and to open access to a larger group of 
students. While in ttie period before 1950 there were 
4 options, the Bac has diversified into some 53 
options and 3 types of Bac diploma: secondary 
(general) education diploma, with 8 options; techni- 
cian/vocational Bac, with 20 options; and, since 
1985, a new vocational diploma with 25 options.^^ 
The vocational and technical programs have been 
strengthened and the numbers of students enrolled 
are also rising. 

Indeed, one of the goals of education reform in 
France has been to democratize the Bac. Between 
ages 13 and 15, the proportion of children attending 
schools leading to the Bac drops from 95 to 67 
percent. Among these, one-half actually passed the 
Bac in 1990, i.e., 38.5 percent of students in the 
relevant age group were eligible for admission to 
university. In 1991, 46 percent of the examinees 
passed.*^^ This represents a dramatic reform to the 
French pyramidal system: in 1955, only about 5.5 
percent of French students qualified for university- 
level education.''^ The French Government has set a 
goal for the year 2000 to have 80 percent of students 
in the age group reach the Bac level.^^ Part of this 
process is the creation of a number of new techno- 
logical, vocational, and professional fiac5, and better 
counselling for students concerning specialties, 
along with restructuring of the Bac to make all tracks 
as prestigious as the ''Bac C," the matliematically 
oriented track.^^ 

Despite these changes, the Bac remains a revered 
institution in France. It is debated each year as 
questions and model answers are printed in newspa- 
pers after the examinations are given each spring. A 
central core of general education subjects (e.g., 
French literature, philosophy, history, and geogra- 



phy) is required of all candidates, but different 
weights are given in scoring them depending on the 
student's specialization. Examination formats are 
generally composed of four types of questions: the 
dissertation — an examination that consists of a 
question to be answered in the form of an essay; a 
conunentary on documents; open-ended questions; 
and multiple-choice questions for modem foreign 
languages.^^ While MOE formulates the various Bac 
examinations, working from questions proposed 
each year by committees made up of tycee and 
university teachers, each academy provides its own 
version from centrally approved lists. Thus ques- 
tions for each subject, though all of the same nature 
and level of difficulty, vary from one region to 
another. Ibachers are given some latitude to set their 
own standards of grading, and there have been 
concerns regarding a lack of common standards and 
comparability in the various forms of the Bac. 

Ibday the Bac can no longer be described as a 
single nationally comparable examination adminis- 
tered to all candidates. While success in the Bac 
remains the passport to university study, it has been 
suggested tliat today there is more than one class of 
travel in a two-speed university system.^^ Thus entry 
to the slower track remains automatic with the Bac, 
but entry into more remunerative and prestigious 
lines of study {classes preparatoi es of grandes 
ecoles and faculties of medicine, demistry , and some 
science departments) require high scores in a more 
difficult Bac series. Students who wish to seek 
admission to the highly selective grandes ecoles, 
which provide superior study conditions and en- 
hanced career opportunities for higher ranks of 
government service, professions, and business, com- 
pete in another examination, the concours^ usually 
taken after another year or two of intense prepara- 
tion. This competition is rigorous; only 10 percent of 
the age cohort attends the grandes ecoles.^^ Thus, 
competition to enter a prestigious university or 



^^Sylvie Aiivillain, cultural service, French Embassy, Washington DC, personal communication, August 1991. 

''^Embassy of France, Cultural Service, Organisation of the French Educational System Leading to the French Baccalaureat (Washington, DC: 
January 1991). 

^•AuviUian, op. cit., footnote 76. 

'^^kstein and Noah, op. cit, footnote 6, p. 304. 

National Endowment for the Humanities, op. cit, footnote 37, p. 9. 

"Ibid. 

•^Krecft, op. cit, footnote 71, p. 16. 
'^Eckstein and Noah, op. cit., footnote 6, p. 309. 
O lbid.»p.304. 
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professional track has maintained the high value 
placed on examinations in France. 

Germany 

Gemiany is credited 
with pioneering the use 
of examinations in Eu- 
rope. In 1748, candidates 
for the Poissian civil serv- 
ice were required to take 
an examination. Later, 
as a university education 
became a prerequisite for 
government servi^;e, the 
Abitur examination was introduced in 1788 as a 
means for determining completion of middle school 
and consequent eligibility for a university entrance.^ 

Ibday tracking into one of three lines of schooling 
begins at approximately age 10 in Germany. After 
completing 4 years of common schooling (grund- 
schule), German students move into one of three 
lines of schooling. The hauptschule (main school) or 
lower general education extends for 5 years and 
leads to terminal vocational training at about age 16. 
The realschule or higher general educat'v/r> extends 
for 6 years and directs students to intermediate 
positions in occupations. The gymnasium is the 
university track and extends for 9 years. Hiere is also 
a gesamtschule: 6 or 9 years of comprehensive 
schooling containing all three lines. Ehiring each of 
these levels of schooling there are relatively few 
examinations until their conclusion. There is a 
reasonable balance in the number of openings for the 
next level for each track, and examination pressure 
is not terribly intense at this level.^^ Because of a 
traditionally strong and well-respected vocational 
track, Gemiany's dual system means that students 
have several options available to them. Ironically, 
the traditional distinctions between these two career 
paths is becoming somewhat blurred and so, by the 
same token, is the function of the Abitur. Increasing 
numbers of Abitur holders are turning toward 
apprenticeship or technical training rather than 



Size 137743 square ^lle tightly 

smaller than P.tontana 

Population 77,555,000 (1990) 

School enrollment i 1.0 million (1986) 

Age of compulsory 

schooling 6 to 16 

Numt)er of sch ool days 160 to 170 (varies per State) 

Selection points and mi^or 

examinations 1 . Tracking at end of common 

school (age 10) Into three 
linesof schooling, but not via 
examination 
2. At)lturB{ end of grade 13 for 
university entrance, 
determined by each State 
(/and), with oversight by na- 
tional government 
Curriculum conTc Laod control 

academic careers, changing the function of the 
examination process.^*^ 

At ;he conclusion of grade 13 in the gymnasium, 
students take the Abitur, which entitles them to study 
at their local university or any university in Ger- 
many.^ The specific content of each Abitur is 
determined by the education ministries in the 
various lander (or States) in Germany, within a 
general framework established by the national Stand- 
ing Conference of Ministers of Education and 
Cultural Affairs. It should be noted that the Abitur, 
like the French Bac, has changed over the years as 
the number of students in gymnasium has increased, 
and greater numbers of Abitur holders has meant 
restrictions on their constitutional right to enroll at 
a university in a chosen course of study. In 1986, 
23.7 percent of the relevant age group held the 
AbiturP 

In the past, the Abitur required candidates to 
complete an extraordinarily demanding curriculum, 
but in recent years the breadth and depth of studies 
has been reduced as variety and options have added 
diversification to what was once a relatively uniform 
examination. Demands made on students have been 
subject to swings; in 1979, candidates could take 
selected subjects at lower levels of difficulty, but in 
the fall of 1987 the Council of Ministers reconsid- 
ered these changes and restored some of the older 
regulations and standards, especially limiting candi- 



•^Cunmiings. op. cit, footooto 43. p. 90. 
«Ibid.. p, 92. 

"^Eckstein aitd Noah, op, cit., footnote 6, p. 306. 

^Quite a high niunber of students do not study at their local university, but at another elsewhere in Oermany . Lack of places at the local university 
means that some students have to study at distant universities. Reinhard Wiemer, second secretary, German Eml)assy, Washington DC, personal 
conomunication, August 1991. 

Q 'National Endowment for the Humanities, op. cit., footnote 37, p. 29. 
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dates' freedom to select subjects at lower levels of 
difficulty. Students ci 'K>se four subjects in which to 
be examined^ across U w categories of knowledge: 
languages, literature, and the arts; social science; 
and mathematics, natural sciences, and technology. 
Examinations are strongly school-bound, with much 
effort placed on tying questions to the training 
provided by a particular school. Even if questions 
are provid^ centrally across a land, different sets 
are provided from which teachers may choose. In 
virtually all lander, the assessment of die examina- 
tion papers takes place entirely within the school, by 
the students' own teachers. Only Baden-Wurtenberg 
has a system of coassessment by teachers of other 
schools.^ 

Examinations always consist of open-ended ques- 
tions, which usually require essay responses. Some 
examinations are oral, while others, in subjects such 
as art, music, and natural sciences, may involve 
performance or demonstration.^^ 

Despite the open format of the Abitur, there has 
been more concern with comparability across the 
various lander than across individuals, since school- 
ing is a land prerogative. There is a delicate balance 
between State ownership of examinations and na- 
tional comparability. As a result, some /a/t^er regard 
Abitur earned in other lander with a certain degree 
of suspicion, limiting student ease of movement to 
universities across the country and conparability 
and transferability of credentials.^^ 

Sweden 

Swedish schooling has 
always been character- 
ized by a blend of central 
control of curriculum and 
decentralized manage- 
ment and assessment. In 
seckir^g to offer equiva- 
lent education to all stu- 
dents, regardless of social 
background orgeographic 
location, there has been a national curriculum. 




Size 173J31 square miles, slightly 

larger than California 

Poputatlon 8,407,000 (1990) 

School enrollment 1.2 million (1987) 

Age of compulsory 

schooling 7 to 16 

Number of school days 180 

Selection points and msjor 

examinations 1. After vicmpulsory school (age 

16) admission to upper 
secondary school 
igymnaslBSholan) by marks, 
not examinations. 
2. University entrance by 
grades or the Swedish 
Schdastlc Aptitude l^st 
(national tests). 

Curriculum control National, oonrvnon curriculum 

with local flexibility 

accompanied by detailed earmarking of jrrants to 
municipal authorities for the organization and ad- 
ministration of schools. Recent reforms have speci- 
fied that the national government will indicate goals 
and guidelines, while municipalities are responsible 
for the achievement of targets set by the national 
education authority. Each municipality will re^ >We 
financial support from the national authority, but 
without det^ed spending regulations.^^ 

Compulsory schooling for Swedish children be- 
gins at age 7 and extends through grade nine, to age 
16. The elementary school (Grundskol) is divided 
into three levels: lower (1 to 3); middle (4 to 6); and 
upper (7 to 9). Students remain in common heteroge- 
neous classes throughout the first 9 years, but at the 
upper school level (grades 7 to 9) they begin to 
choose from a number of elective courses. There is 
a common curriculum for all schools at each level; 
those studying any given subject at the same level 
follow the same curriculum, have the same number 
of weekly periods, and use common texts at \ 
materials. However, it is understood that within the 
general framework it is up to the teacher to develop 
his or her own approach to teaching the subject.^^ 

After finishing compulsory schooling at age 16, 
the great majority of students continue on to the 
integrated upper secondary school or gymnasieskolan. 
At the upper secondary school, there are a variety of 



»Krecft, op. cit, footnote 71, p. 18. 

^'National Eodowmeot for the Humanities, op. cit, footnote 37, p. 29. 
'^Eckstein and Noah, op. cit, footnote 6, p. 314. 

9^ As of July 1 , 199 1 , the National Board of Education and regional country education committees were abolished and a new central education authority 
was established. Karln Rydbctg, A Redistribution of Responsibilities in the Swedish School System (Stockhohn, Sweden: The Swedish NaUonal Board 
of Education, January 1991). 

Q ^National Swedish Board of Education, ''Assessment in Swedish Schools,'* informational document, February 1985, p. 1. 
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courses of study in 2-, 3-, and 4-year programs. 
Overall some 25 options or lines of study are 
available, each characterized by a combination of 
special subjects and a common core of compulsory 
subjects.^^ Admission to the integrated upper sec- 
ondary school is based on teacher grades (referred to 
as marks) obtained in elementary school, with a 
certain minimum average required. All subjects 
(including music, drawing, and handicraft) are 
included in computing the marks, with none weighted 
more heavily than any other. In 1983, approximately 
85 percent of the age cohort were admitted to the 
gymnasieskolan, with 10 percent applying and not 
admitted, and about 5 percent not applying to upper 
level schooling.^ 

Assessment Swedish schools consists of both 
marks and standardized tests (centralaprov). The 
individual teacher is solely responsible for the 
marking, and no educational or legal authorit}^ can 
alter a given mark or force a teacher to do so. Marks 
are given at the end of each course as a means of 
providing information to the students and parents on 
the student's level of success in a course, and are the 
basis of selection of students for admission to the 
upper secondary school and to the university. Thus 
there is considerable effort to provide assurance that 
marks have the same value, despite the fact that 
marks are given by thousands of individual tear|iers 
across the country. 

The main purpose of standardized achievement 
testing in Sweden is to enable the teachers to 
compaie the performance of their own class with that 
of the total population and adjust their marking 
scale. While the centralaprov are developed by the 
national education authority, the tests correspond 
closely to the syllabi and are aimed at measuring 
achievement based on national standards. All stand- 
ardized tests, which are short answer, fill in the 
blank, and short essay examinations, are centrally 
developed but administered and graded by the 
classroom teacher. Detailed instructions on scoring 
principles are issued by the national board. A sample 
of results representative of the total population of 
students tested is submitted to the national board, 
and marking norms are developed so that test results 
can be converted into one of the marks on the S-point 
Swedish scale. These norms are then sent to all 



schools, and teachers mark their tests based accord- 
ingly. 

Although some tests are used for diagnosis at the 
classroom level, neither these nor centralaprov are 
used for selection or school accountability in the 
sense of ranking schools. A large number cf 
standardized tests measuring skills and knowledge 
are used, along with diagnostic materials. Achieve- 
ment testing is not conducted uTitU grade eig|ht (in 
English) and grade nine in Swedish and mathemat- 
ics. All standardized tests at the elementary level are 
voluntaiy for the school and/or teacher; however, 
about 80 percent of all teachers use them. These tests 
are used repeatedly over a period of some years and 
are kept confidential.^^ 

In the upper secondary school, the standardized 
achievement tests must be given in each subject. 
These, too, have been developed by the national 
board and are scored by teachers. 

Final assessment of each student at the end of a 
temi is a carefully orchestrated business. Ibachers 
keep records of each student's performance on 
compulsory written tests (in addition to the standard- 
ized tests); these are filed and made available when 
the inspectors from the county education commit- 
tees visit schools. On t^ese visits, they check to see 
if the marking principles applied by the teacher are 
more lenient or severe than national norms. At the 
end of a term, the teacher surveys all evaluation data 
collected above (written tests, standardized tests, 
and observations based on running records) and 
ranks the pupils in the class from top to bottom on 
the same 5-point scale. 

Here again the standardized tests play an impor- 
tant role. First the teacher calculates the mean of the 
preliminary marks and records their distribution 
over the 5-point scale, then compares these data with 
the mean and distribution of marks obtained by the 
class in taking the standardized tests. These results 
are compared and the teacher adjusts the preliminary 
marks as he or she sees fit, depending on the 
circumstances surrounding the standardized test (the 
class may not have covered some part of the 
standardized test, or there may have been several of 
the best or the weakest pupils missing when the test 
was administered, thus skewing results.) The final 



95lbid., pp. 3-5. 
9«Ibid.. p. 3. 
Q 97lbid..p. 13. 
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judgment is the teacher's, although a meeting called 
the class conference, attended by the head, assistant 
head, and all teachers teaching the class for one or 
more subjects, is also held. At this meeting, compa&« 
isons are made between the standard achieved in 
different subjects and between the achievements of 
different classes in the same subject. ''A teacher 
who wants to retain noticeable differences between 
test results and preliminary marks has to convince 
the class conference that there is a valid reason for 
doing so.**^^ 

Sweden abolished its school-leaving examination 
(for giaduation) in the mid-1960s. From that point 
on, admission to universities and colleges for 
students coming directly from the upper secondary 
school has been based entirely on the marks given by 
teachers. Applicants 25 years or older and with more 
than 4 years of work experience were admitted based 
on the Swedish Scholastic Aptitude Ibst (SWESAT). 
This test consists of 6 subtests, for a total of 144 
multiple-choice items, with a testing time of approx- 
imately 4 hours. The SWESAT is administered by 
the National Swedish Board of Universities and 
Colleges, with test construction placed in the hands 
of the department of education at Umea University. 
About 10,000 persons take the test each year. The 
selection procedure was part of an elaborate system 
of quota groups to ensure a fair distribution of 
openings for different groups of applicants. There 
are three groups: those submitting formal measures 
of academic ability—grades and SWESAT for those 
who have not completed upper secondary education; 
those relying on work experience — ^which for all 
groups of applicants may compensate for a low score 
on academic ability; and a small number of places 
for those accepted for special reasons, despite low 
scores. 

In the 1970s and 1980s, the number of applicants 
to higher education greatly exceeded the number of 
available places, and this created debate. The 
existing system of quotas was criticized for being 
cumbersome, uniform, and complex. Furthermore, 
the use of work experience was criticized on the 
grounds that it delays the transition to higher 



education. In fact work experience has become 
almost compulsory for many programs in high 
demand. (Ibday die average age of a first-year 
freshman in Sweden is 23.) The fact that practically 
all experience is given credit, regardless of relevance 
to the study program in question, has also been 
debated. Some believe the system should give 
weight largely to academic ability as a better 
predictor of success in higher education. 

As a result of this debate, the Swedish Parliament 
established a new scheme for selection to higher 
education that more strongly stresses the need for 
measures of academic ability and restricts the role of 
work experience. The new system, which went into 
effect in July 199 1 , uses several factors for determin- 
ing admission. Average grades from upper second- 
ary school will continue to constitute a major factor 
in the selection process. (Between one-half and 
two-thirds of all students will be selected on the 
basis of grades alone.) A general aptitude test 
(currently the SWESAT) is open to students leaving 
upper secondary school as well. This is seen as an 
alternative path to higher education for those who do 
not have sufficient grades. Between one-third and 
two-thirds of all students will be selected on the 
basis of the test results. Finally, flexibility is being 
added to ensure that a small number of students can 
be admitted on an individual basis.^ It is not yet 
clear what the impact of these changes will have on 
school curriculum acm Sweden. 

England and Wales 

The Education Reform 
Act (ERA) of 1988 set in 
motion a major overhaul 
of the education system 
of the United Kingdom 
(England, Wales, and 
Northern Ireland).^^ Al- 
though authority over the 
schools had been shift- 
ing from local to central 
government at least since the second World War, the 
1988 reforms were seen by many as a watershed 
event. One analysis by comparative education re- 




5»Ibid..p. 17. 

^^Hans Jansson, ' 'Swedish Admissions Policy on the Road From Uniformity and Central Planning to Flexibility and Local Influence? * * paper prepared 
for the International Association of Educational Assessment^ November 1989* See also Ingemar Wedman, Department of Education* University of Umea, 
Sweden* ^'Hie Swedish Scholastic Aptitude Ibst; Development, Use and Research,** unpublished document, October 1990. 

^ ^^'^reareactuaUythreeeducationsystcmsintheUnitedKingdomronefor 

Q I report deals predominantly with England and Wales, but all three systf ms are refomiing curriculum and assessment programs* 
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Size 94,226 square miles, slightly 

smaller than Oregon 

Population 57,121,000 (1990) 

School enrollment 10,089,000(1983) 

Age of compulsory 
schooling 5 to 16 

Numt)er of school days 192* 

Selection points and major 

examinations 1* New national assessments 

at age 7, 11, 14, 16 (not for 
selection) 
2. Two-tiered school-leaving 
examinations: Qeneral 
Certificate of Secondary 
Education at age 16 or 
earlier; ''A levels'' at grades 
1 1 or 12 (sixth form) at age 
18 (all set by local boards, 
national oversight, 
considered for university 
entrance) 

Curriculum control National, central control (since 

1988) 

^Wayn« Rlddl«» Congrosslonal RoMarch SttrvicA, parsonal oommunlca- 
tlon,Nov. 26,1991. 



searchers concluded that the reforms '\ . .repre- 
sented an abrupt acceleration of the otherwise 
glacially slow process of transferring authority over 
tfie schools from local to central government/ '^^^ 



England always had a diverse and decentralized 
school system. The great universities and ''public" 
schools/^ which were closely tied to the Church of 
England, existed for the upper classes; there was no 
need for selective entrance examinations, given that 
student qualifications were not au issue for admis- 
sions.^^ In the middle of the 19th century, England's 
highly decentralized system distinguished it from 
other European countries, which aheady had strong 
central curricula and uniform schooMeaving exami- 
nations, lb bring some order to the system, the 
British Government instituted the ''payment by 
results'' system. Beginning in 1861, local govem- 
ments whose students performed well on a special 
national test received extra subsidies. The goal of 
this policy was to promote quality in key subject 
areas. There was no attenrqpt to create a central 
curriculum.^^ This testing program was eventually 
scuttled because of dissatisfaction with the inequali- 
ties it aggravated. Schools that had the most difficult 



problems were those that suffered most under the 
system; essentially the rich got richer. 

Following World War n, in an effort to democra- 
tize secondary school selection procedures, the 1 1+ 
examination was developed. These were local exam- 
inations, run by local education authorities (LEAs). 
The goal was to track students at age 1 1, according 
to ability, as measured on the examination and 
according to need. Roughly 20 percent of students 
were tracked into granrniar school (i.e., the college 
preparatory track) and the rest into secondary 
"modem" schools. As LEAs introduced compre- 
hensive schools in place of the grammar and 
secondary modems, the 11+ was no longer needed. 
Although it is still in use in a small number of places 
in England and Wales, by and large the 11 -f was 
dropped during the 19d0s and 1970s. 

The General Certificate of Secondary Education 
(GCSE) continues the tradition of local control of 
curriculuni' and testing. Although the concept of 
merging fJie prior "ordinary" examination ("O 
levels") and the GSCE examinations goes back to 
the early 197CSj the first GCSE examinations were 
administered in 1988. The GCSE became the single 
examination, mirroring the switch from the grammar 
and secondary modems to one comprehensive 
school. The GCSE is taken by students at the age of 
16 or earlier. Local groups of teachers and school 
administrators, through the examining boards, intro- 
duce examination topics related to their own syllabi. 
A central School Examinations and Assessment 
Council, established by Parliament, establishes na- 
tional examination criteria to which all GCSE 
syllabi and examinations must conform. Recruit- 
ment into certain jobs and selection into advanced 
training are influenced by the number and quality of 
passing grades on the GCSE. 

More advanced examinations, the "A levels" are 
also offered in the upper grades of comprehensive 
school (age 18). Success on at least three A levels 
has become an important criterion for advancement 
to university study. Thus the school-leaving exami- 
nation system in the United Kingdom has evolved 
into a two-tiered examination system. A recent 



I Noah and Eckstein, op. ci(., footnote 44, p . 25 . Ibis chanicterizatioD may t>e somewhat overstated, given that local management of schools xemalns 
an important component of the school system. Rot)ert Ratcliffe, academic progmms officer, The British Council, petsonal communication^ Aug. 15, 
1991. 

><'^English **public*' schools would be called ^'private'' in the American idiom. 
I03(;^iiiiiiuiiig8, op. cit., footnote 43. 
Q i<Mlbid.,p.93 
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survey of 16-year-olds in England showed slightly 
over one-half planning to continue their education. 
About one-third of the country's 16-year-olds, who 
achieved grades of A, B, or C (on a scale of A to G) 
on five or more of their GCSE examinations, are 
most likely to continue. In 1988-89, 22 percent of all 
18-year-olds in England passed one or more A-level 
examinations; 12 percent, three or more.^^^ Students 
have, in the past, been able to select their own 
subjects for the GCSE and its predecessors and for 
the A levels.^^ There is some concem that early 
specialization in grades 11 and 12, to prepare for A 
levels, is one factor causing many students to 
abandon study in mathematics and die sciences at 
age 16 in favor of the humanities or social sciences. 

The backgrouiid for the 1988 reform was similar 
to the push for (educational changes in the United 
States: business people were complaining that stu- 
dents arrived at the workplace lacking basic skills, 
while others were troubled by inequalities in teach- 
ing, resources, and by an education system out of 
sync with technology. The Conservative Govern- 
ment under Margaret Thatcher put into place a 
reform bill that forced the issue. As the chief 
executive of the newly established National Curricu- 
lum Council noted: ''The educational establish- 
ment, left to its own, will take a hundred years to buy 

a new stick of chalk In the end, to say: 'It's time 

you guys got on with it; here's an act and a crisp 
timetable' was probably necessary."^^ 

First and foremost, the ERA defined a comprehen- 
sive national curriculum for all public school stu- 
dents ages 5 to 16. These students are to take 
foundation subjects: core subjects are English (Welsh, 
in Wales), matfiematics, and science, plus, for 1 1- to 
16-year-olds, technology (including design), his- 
tory, geography, music, art, physical education, and 
modem foreign language. Attainment targets set 
general objectives and standards for 10 levels 
covering the full range of pupils of different abilities 
in compulsory education. Average pupils will reach 
level two by age 7; each new level represents, on 
average^ 2 years of progress. The statements of 
attainment provide the basis for the assessment 



arrangements. Assessment is to take place by 
classroom teachers throughout the year^ with special 
soundings via national tests known as standard 
assessment tasks (SATs) given at or near the 
completion of each of four ''key stages' ' of teaching 
(ages 7,11, 14, 16). 

The assessments are meant to serve multiple 
purposes: 

. . .formative^ providing information teachers can 
use in deciding how a pupil's learning should be 
taken forward, and in giving the pupils themselves 
clear and understandable targets and feedback about 
their achievements; summative^ providing overall 
evidence of the achievements of a pupil and of what 
he or she knows, understands and can do; evaluative, 
providing comparative aggregated information about 
pupils' achievements as an indicator of where there 
needs to be further effort, resources, changti in the 
curriculum; and informative, helping communi- 
cation with parents about how their child is doing 
and with governing bodies, LEAs and the wider 
community about the achievements of a school. 

The objective is to keep the schools working 
within a national framework but with local discre- 
tion in hnplementing the curriculum. As parents can 
now send children to any school they choose, it is 
anticipated that parents will compare published 
examination results of schools, and thus schools will 
try to raise standards to attract more pupils.^^ But 
there is concem that comparisons may mask differ- 
ing social and economic levels of students, and that 
problems associated with the "payment by results" 
approach of 100 years ago could retum. Tbachers 
also feel overwhelmed by the requirements of the 
program: the double system of assessment at key 
stages — with the SATs as well as continuous 
assessment in the classroom — means that British 
school children will soon be the most assessed in 
Europe. 

The program is being implemented at the primary 
level in the spring of 199 1 and will be phased in over 
the next 3 years. Secondary students may be 
assessed through OCSEs or according to National 
Curriculum assessments at age 16. GCSE criteria 



i^Nadonal Endowment for the Humaolties* op. cit., footnote 37» p. 45. 

i^F^w schools allowed students to omit mathematics and English for the General Certificate of Secondary Education and its predecessors, but rules 
about what must be studied at this level will become tighter under the national curriculum assessment. Nuttal, op. cit.» footnote 14. 

io7Tim Brookes. ''A Lesson to Us AH/* The Atlantic, vol. 267. No. 5, May 1991, p. 28. 

i^Depaitment of Education and Science* National Curriculum: From Policy to Practice (Stanmore. England: 1989)» p. 6. 
Q i^rookes. op. cit.. footnote 107. 

ERIC .) 170 



762 • Testingin American Schools: Asking the Right Questions 



and syllabi will be brought into line with the 
statutory requirements for attainment targets, pro- 
grams of study, and assessment strategies, but the 
relationship between National Curriculum's 10 lev- 
els of attainment and the GCSE grades has yet to be 
determined."^ In early 1991, plans were annonnced 
to require all students to take GCSEs in the three 
core subjects of English (or Welsh), mathematics, 
and science. The study of either history or geogra- 
I^y, technology, and a modern foreign language is 
also compulsory to age 16. S^^^ents can choose 
whether to have then: competence in these and other 
subjects assessed by GCSE examinations."^ 

The SATs are one of the most interesting features 
of the program, and the feature most likely to 
influence curriculum. As in most European testing 
programs, the SATs have only open-ended ques- 
tions. Many innovative testing approaches were 
developed for an earlier comprehensive assessment 
England embarked on in 1975."^ These innovative 
test items and formats are the basis for many of the 
performance testing items that are to become the 
backbone of the SATs and classroom assessment 
procedures under the new program. 

A nationally representative sample of students at 
ages 11, 13, and 15 were tested in a survey similar 
to NAEP. The 1975 goal was to assess the achieve- 



ment and knowledge of student performance in four 
areas: mathematics, language, science, and foreign 
languages. 

Mathematical abilities were tested in several 
formats, including 50 short-response items drawn 
from a total of 700 test items in each survey. A 
subsample of students in each age group were given 
written tests of problem-solving skills; another 
subsample of 1,200 students in each age group were 
given oral tests of problem-solving tasks. The 
mother language survey assessed reading, writing, 
and **oracy,** a term coined for its analogy to 
literacy as a measure of the ability to communicate 
effectively in a spoken as opposed to written 
medium. The science assessments were made up of 
individual a*^^ small group tasks emphasizing prac- 
tical skills performed at a number of ''stations.'' 
Foreign language testing used oral and written 
testing formats. 

The program led to the evolution and application 
of innovative techniques to assess student perform- 
ance, such as mathematical skills in a practical 
context, especially those whose mathematical abili- 
ties were masked by reading difficulties; written and 
spoken skills in the mother tongue and in foreign 
languages; and practical assessments in science. 



tl^^paftment of Education and Science, op. cit., footnote 108, paragraph 6.7. 
11 'National Endowment for the Htuuanities, op. cit., footnote 37, p. 45. 

li^ciarc Burstail, "Innovative Forms of Assessment: A United Kingdom Perspective,* * Educational Measurement: issues and Practice^ vol. 5, No. 
O pring 1986. 
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CHAPTER 6 

Standardized Tests in Schools: A Primer 



Highlights 

A test is an objective and standardized method for estimating behavior, based on a sample of Uiat 
behavior. A standardized test is one that uses uniform procedures for administration and scoring in oider 
to assure that results fix>m different people are con^arable. Any kind of test— fiom multiple choice to 
essays to oral examinations — can be standardized if uniform scoring and administration are used. 

Achievement tests are the most widely used tests in schools. Achievement tests are designed to 
assess what a student knows and can do as aresuU of schooling. Ammg standardized achievement tests, 
multiple-choice formats predominate because they are efficient, easily administered, broad in their 
coverage, and can be madiine scored. 

Advances in test design and technology have made American standardized achievement tests 
remarkably sophisticated, reliable, and precise. However, misuse of tests and misconceptions about what 
test scores mean are common. 

Ibsts are often used for purposes for which they have not been designed. Ubsts must be designed 
and validated for a specific function and use of a test should be limited to only those functions. Once tests 
are in the public domain, misuse ot misinterpretation of test results is not easy to control or change. 

Because test scores are estimates and can vary for reasons that have nothing to do with student 
achievement, the results of a single test should never be used as the sole criterion for making important 
decisions about individuals. A test must meet high standards of reliability and validity before it is used 
for any "high-stakes" decisions. 

The kind of infomiation policymakers and school authorities need to monitor school systems is very 
different from the kind teachers need to guide instruction. Relatively few standardized tests fulfill the 
classroom needs of teachm. 

Existing standardized noim-referenced tests primarily test basic skills. This is because they are 
**generic" tests designed to be used in schoob throughout the Nation, and basic skills are most common 
to aU curricula. 

Current disaffection with existing standardized achievement tests rests largely on three features of 
these tests: 1) most are norm-referenced and thus conqpare students to one another, 2) most are multiple 
choice, and 3) their content does not adequately represent local curricula, especially thinking and 
reasoning skills. This disaffection is driving efforts among educators and test developers to broaden the 
format of standardized tests. They seek to design tests more closely matched to local cunicula, and to 
design tests that best serve tfie various functions of educational testing. 

Changing the format of tests will not, by itself, ensure that tests are better measures of desired goals 
nor will it eliminate problems of bias, reliability, and validity. In part because of these technical and 
administrative concerns, test developers are e]q)loring ways to improve multiple-choice formats to 
measure complex thinking skills better. As new tests are designed, new safegu^s will be needed to 
ensure they are not misused. 



How Do Schools Test? 

Nearly every type of available test designed for 
use with children is used in schools. Ibsts of 
personality, intelligence, aptitude, speech, sensory 
acuity, and perceptual motor skill, all of which have 

er|c 



applications in nonschool settings as well, are used 
by trained personnel such as guidance counselors, 
speech-language specialists, and school psycholo- 
gists. Certain tests, however, have been designed 
specifically for use in educational settings. These 
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Figure 6-l-1bsts Used W2th Children 



other tests 



Educational achievement tests 



Nonstandardized 



Standardized 



Criterion-referenced 




Norm-referenced 



Intelligence/aptitude 
Personality 
Developmental scales 

for Infants 
Speech/oral language 
Motor proficiency 
Medical 

Sensory acuity 

(e.g.. vision, hearing) 
Driver's license exam 
Auditions (performing arts) 

Athletic try-outs and Test and item formats 

competitions • Multiple-choice 

• True-false 

• Constructed-response 

• Essays 

• Oral exams 

• Exhibitions 

• Experiments 

• Portfolios 

SOURCE: Offic« of Technology AsMssment, 1992; adapted from F.L. Finch, Toward a Definition for Educational 
Performance Aeeesament," paper presented at the ERIC/PDK Symposium, August 1 990. 



Most existing 
standardized 
achievement 
tests 



Performance 
assessment 



tests, commonly refeired to as achievement tests, are 
designed to assess student learning in school subject 
areas. They are also the most frequently used tests in 
elementary and secondary school settings; with few 
exceptions all students take achievement tests at 
multiple points in their educational careers. Educa- 
tional achievement tests are the primary focus of this 
report. 

Figure 6-1 shows the distinction between educa- 
tional achievement tests and the other kinds of tests. 
Achievement tests are designed to assess what a 
student knows and can do in a specific subject area 
as a result of instmction or schooling. Achievement 
test results are designed to indicate a student's 
degree of success in past learning activity. Achieve- 
ment tests are sometimes contrasted with aptitude 
tests, which are designed to predict what a person 
can be expected to accomplish with training (see box 
6-A). 

Achievement tests include a wide range of types 
of tests, from those designed i>y individual teachers 



to those designed by commercial test publishing 
companies. Examples of the kinds of tests teachers 
design and use include a weekly Sj[K.]ling test, a final 
essay examination in history, or a laboratory exami- 
nation in biology. At the other end of the achieve- 
ment test spectrum are tests designed outside the 
school system itself and administered only once or 
twice a year; examples of this include the familiar 
multiple-choice, paper-and-pencil tests that might 
cover reading, language arts, mathematics, and 
social studies (see box 6-B). 

The first important distinction when talking about 
achievement tests is between standardized and 
nonstandardized tests (see figure 6-1 again).^ A 
standardized test uses uniform procedures for ad- 
ministering and scoring. This assures that scores 
obtained by different people are comparable to one 
another. Because of this, tests that are not standard- 
ized have limited practical usefulness outside of t'le 
classroom. Most teacher-developed tests or * ^backn >f- 
the-book* • tests found in textbooks would be consid- 



^Fredrick L. Finch, The Rivcisidc Puc>lishii»g Co., •'Ibward a DefmiUon for Educational Fciformancc Assessment,** paper presented at the 
^"Z/PDK Symposium, 1990. h f f 
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Photo crmMt: Dmk QtiHomy 



Standardized achievement tests are often administered to many students at the same sitting. Standardization nrteans that 
tests are administered and scored under the scune conditions for all students and ensures that results are 

comparable across classrooms and schools. 

c 
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Box Achievement and Aptitude Tests: What is the Difference? 

Atten^Ms to measuie leaniUig as a lesult of sdiooling (adiievement) and attempts i:> mcasuit qHitude 
(including intelligence) each have different, yet inteitwined, histories (see ch. 4). Intelligence testing, with its strong 
psychometric and scientific eoq^is, has influenced the design of achieveinent tests in this countiy . Achievement 
tests are generally distinguished from iqMitude tests in the degree to which they are expliddy tied to a couise of 
schooling. In die absence of common national educational goals, the need 

by any student has resulted in tests more remote from nptci&c cun jcula than tests developed close to die classioom. 
The degree of diiference can be subtle and die testes title is not always a reliable guide. 

Atestpioducer*8 daims for an ac^vemem test <»r an qHitudc test do not me^ 
ciicumstances widi all piqpila.' 

There deady is overiqp between a pi^il^s measured ability and aduevement, and peihi^M die final answer to die 
question of whether any test assesses a piq>il *s adiievement or a more general undedying trait audi as verbal ability 
rests with the locd user, who blows die stodem and die Gurricdum he or dM 

The fardier removed a test is from die specific educatiuial curricula diat has been deli vered to die test taker , die more 
diat test is likely to resemble a measure of qMimde instead of adiievement for diat student. 

Whenev^ tests are gdng to be used for policy decisions about die effectivmess of education, it is important 
to assure diat diose tests are measuring achievement, not ability; inferences about school effectiveness must be 
direcdy tied to what die school actualty delivers in die classroom— not to wluit children ataeady bring to die 
classroom. A:xordingly, tests designated for accountability should be shown to be sensitive to die effects of 
school-related insoucdon.^ 

lb understand better die distinctions currmdy made between achievement and aptihide tests, it is helpful to 
turn to one of the ^'pillars of assessment development,**^ Anne Anastasi: 

Surpassing aU odier types of standardized tests in sheer number, adiievenm 
effects of a specific progiam of instruction or training. It has been customary to contrast achievement tests widi 



^Bric Oafdoer, ''Some Aq)ect8 of die Use and Nfisuse of Standardized i^tode and Adiievemeiit Utti/* Ability Testing: Uses, 
Consequences, and Controversies, part 2, Alexandm K. WJIgdor aad Wendell R, Gamef (edf .) (Waahltiigtoo, DC: Nalkmd Academy Fktss, 19S2), 
p. 325. 

^FtM W. Airasiaii, ••Review of Iowa Ibits of Basic SkUlSt Fonns 7 and 8/' The Ninth Mental Measurements Yearbook, vol I. James 
V. Mitehell, Ir. (ed.) (Lincoln, NB: The Udveni^ of Netaaika Pneit. 1983). p. 720. 

^No achievemeat test, diou^ wiU oieaiaie <>fi(^ sd 
his orhtt cvaMlative experiences. ••No tett leveals liow or wliy die individod leacbcd ttuU level" Anne Anastosi, Psychological listing (New 
Yofle. NY: MacMiUian Pd)UiUQg Co, 1988), p. 413. 

^Catoi Schneider Udz, '*HiMoricd PMpectives/^ Dynamic Assessment: An interactional Approach to EvaluaHng Uaming Potential 
C.S. Udz (ed.) (New York, NY: Ouilfoid, 1987), pp. 3-32. 



ered nonstandardized. Although these tests may be 
useful to the individual teacher, scores obtained by 
students on these tests would not be comparabl ^ — 
across classrooms, schools, or different points in 
time — because the administration and scoring are 
not standardized. 

Thus, contrary to popular understanding, **stand- 
ardized' ' does not mean norm-referenced nor does it 
mean multiple choice. As the tree diagram in figure 
6-1 illustrates, standardized tests can take many 
different forms. All achievement tests intended for 
widespread use in decisions comparing children, 
schools, and districts should be standardized. Lack 

standardization severely limits the inferences and 

ERIC 



conclusions that can be made on the basis of test 
results. A test can be more or less standardized (there 
is no absolute criterion or yardstick to denote when 
a test has ''achieved'' standardization); as a result, 
teacher-developed tests can incorporate features of 
standardization that will permit inferences to be 
made with more confidence. 

Most existing standardized tests can be divided 
into two primary types based on tlie reference point 
for score comparison: norm-referenced and criterion*- 
referenced. 

Norm-referenced tests help compare one stu- 
dent's performance with the perfonnances of a large 
group of students. Norm-referenced tests are de- 
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aptitude tests, the latter indoding geaend intelligence testa, multqde aiMitiide batteries, and qwcial aptitude tesu. 

From one |.oim of vkw. die diffemnce between adhievetnm ini tpdtaia testing is a dififeienoe in tiie degiee of 

unifonniQr of relevant ant ecede nt eKpeiieaoe. Thus a<a>fevesienr. tests n^easme ifae effiacti of irinfarfy itMwit»««i^i 

sets of o^wtienoes, such as a coutse in elementaiy TieiK^ tiig^ 

Altitude test petfoinianoe reflects die cunmlative influence of a mdt^riic^ 

8^ diat Altitude tests ineasure die effiicts of feaming under rdUttively unoootrdM 

achievement tests measure the effects of learning that occurred under paitMy known and contrcOed cooditJons. 

A second distinction between qptitude and adiievement tests pr ttains to dieir reqwctive uses. Aptitude tests 
serve to predict subseqinent peifonnance. Hiey are enjoyed to estimate die extent to which die individual will profit 
fioma8pecifledoourseofttBining,ortoforec«stdieqttali^ofhiso her achievement faia new situatioa Achievement 
tests, on die odier hand, generally tqireaent a terminal evaluatioo of die individual's status on die conndetion of 
training. The enqAasis on sudi tests is on what die individual can do at dw time.' 

Aldiough in the eariy days of psychological testing aptitude tests were tfiought to measure "iiuiate capacity" 
(unrelated to schooling, experience, or back^ound). while achievement tests were thought to measure learning, this 
is now considered a misconception.^ Any test score wiU reflect a combination of school learning, prior experience, 
ability, individual characteristics (e.g., motivation), and oppoitunities to kam outside of school Aptitude and 
achievement tests differ primarily hi die extent to which the test content is diiectty affected school experiences. 

..^^.*f "P**"*® ^» paiticutariy IQ tests, came under hicieasing scnitiny and criticism. A highly 
pohtical debate, set off by Artfiur Jensen's controversial analysis of tiio heritability of racial differences in 
intelligence, thtust IQ tests faito die limelight Simikuly, the late 1960s and eariy 1970s saw several significant court 
challenges to ttie use of IQ tests hi ability tracking. Piobebly because of tiiese controversies, as weU as hicreased 
understandmg of die Umitations of intelligence tests, many large school systems have moved away fiom using 
aptitude tests as components of dieir basic testing programs.^ These tests are stiU widely nuuketed, however, and 
their use in combmation with achievement tests is often i»omoted. 

Acb« - /ement and aptitude tests differ, but tii© ditlinctions between the two in terms of design and use are often 
bluried. For policy purposes, the essential point is tfiis: even diough a test may be defined as an achievemeni test, 
die more it mov«i away ftom items tied to specific curriculum content and toward items dut assess broader concepts 
and skills, die more die test will fimction as an aptitude test. Should a national test be constructed m die absence 
of national standards or curriculum, it is dierefbre likely to be essentially an aptitude test Such a test will not 
effectively reflect tile results of schooUng. 



'Anwuui, op. cit, footnote 3, pp. 411-414. 

. ^^"1!^' f Program Used in Major School Systems Throughout the United States tn the School Year J977-78 (Ataon. 

OH: Akron PubUc Schools Division of PenoonelindAdniiiilstratioii. 1978). 



signed to make fine distinctions between students' 
performances and accurately pinpoint where a stu- 
dent stands in relation to a large group of students.^ 

These test' are designed to rank students along a 
C(Mitinuum. 

Because of the complexities involved in obtaining 
nationally representative norms, norm-referenced 
tests (NRTs) are usually developed by commercial 
test-publishing companies who administer the test to 



large numbers of school children representative of 
the Nation's student population (see box 6-Q. The 
scoie of each student who takes that test can be 
compared to the perfonnance of other children in the 
standardization sample. Typically a single NRT is 
used by many schools and districts throughout the 
country.^ 

Criterion-referenced tests (CRTs) are focused on 
"... what test takers can do and what they know, not 



2Laweoce Rudncr. Jane Close Conoley, and Barbara S. Plake (cds.). Understanding Achievement lists (Washington, DC: ERIC Clearinehouse on 
Tfestt, Measurement, and Evaluafion, 1989), p. 10. ui6x~i»5 «u 

'Many pubUshere offer district-level nomu as wcU. Several publishers now create custom-developed norm-referenced tests that are based on local 
• ailar objecUves, yet come with national norms. These norms, however, are only vaUd under ceitaUi circumstances. Sec ibid. 

ERIC 
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Box 6-B— Types of Stanc!ard:zed Achievement Tests 

Cunently available standanUzedachievemont tests are likely to be one of four types.* The best known and most 
widely used type is the bioad geneial suivey achievement betteiy . These tests aie used across the entire age range 
fiom kindeiprtm thiough adidt, btt are most widdy used in elementaiy school 
academic areas such as reading, language, mathematics, and sometimes 8cieiK» and 8 
commercial^ devdc^ped, norm-refiMeaced, muh^le-dioice tests. Exanqpks include die Conqnehensive Ibst of 
Basic Skills, the Nbttopolitan Achievement Tbst, and the Iowa Ibsts of Basic Skills (TTBS). In addition, many test 
publishers now offer essay tests of writing that can accompany a survey achievement test. 

In the 1989-90 school year, commercially published, off-the-shelf, achievement battery tests were a mandated 
feature of testing pcogifams hi about two-thirds of the States and the District of Cotambia (see figure 6-Bl). Five 
of those States requhed distrksts to sdect a oommetdal achievement test from a list of approved tests, while 27 
specifiedapaitiailar test to be administered. In addition, many districts require a nonn-rBferenced test (NRI), even 
if dK State does not A survey of all d jtricts in Pennsylvania, which does irH mandate use of an NRT, found tiiat 
91 petoent of the districts used a oommerdai off-the-shelf NRT.^ 

The second type of test if the test of minimum competency in bask skills. These tests are usually 
critrakm-refdenced and are used fbr cei^yhig attahiment and/or awanUng a high school diploma. They are most 
often used in secondary schod and are usually devd(ved by the State or district^ 

Far less fitequendy available as commercially published, standardized tests, the thhd category includes 
achievemem tests in separate content areas. The best known examples ct these are the Advanced Placement 
examinations administered by Ae Collie Boeiu, .ued to test mastery of specific subjects such as history (x biology 
at die end <tf high itehool fiff the purpose of obtdning college credit. 

The final type of achievement test is the diagnostfc battery. Tliese tests differ ti vm the survey achievement 
battery primarily in their spectficity and depth; diajpiostic tests have a more narrowly defined focus and concentrate 
on specilk content knowledge and skills. They are generally designed to describe an individual's strengths and 
weaknesses widiin a subject matter area and to suggest reasons for difiiculties. Most published diagnostic tests cover 
ddier reading or mathenuttos. Many of the diagiiostic adiievement tests need to be individually administered by 
a trained examiner and are used in special education screening and diagnosis. 



^lUi (Uscumuo of (be four ^pei of achievemeat tetu ii dnwa fiom Anne Anasiati. Psychohgkal Testing (New Yoric, NY: MacmlUian 
PDbUiliii«CO., 1988). 

SRom S. Bloit and RidiKd L. Kofar, PcnmyWank Depulmem of Educadoa. "PcootylvanU Scbool District Ibsting Progruus," ERIC 
Document ED 269 409, YM 840.300, Janvaiy 1984. 

3See ch. 2 for a diKonion of wes of mtolmum competency tests. 



how they compare to others."* CRTs usually report 
how a student is doing relative to specified educa- 
tional goals or objectives. For example, a CRT score 
might describe which arithmetic operations a stu- 
dent can perform or the level of reading difficulty he 
or she can comprehend. Some of the earliest 
criterion-referenced scales were attempts to judge a 
student's mastery of school-related skills such as 
penmanship. Figure 6-2 illustrates one such scale, 
developed in 1910 by E.L. Thcmdike to measure 
hand'^friting. The figure shows some of the sample 



specimens against which a student's handwriting 
could be judged and scored. 

Most certification examinations are criterion- 
referenced. The skills one needs to know to be 
certified as a pilot, for example, are clearly spelled 
out and criteria by which mastery is achieved are 
describe}. Aspiring pilots then know which skills to 
work on. Eventually a pilot will be certified to fly not 
because she or he can perform these skills better than 
most classmates, but because knowledge and mas- 
tery of all important skills have been demonstrated. 



ERIC 



^Aimc Anastasl Psychological lasting (New York, NY: MacMiUwi Publiahing Co., 1988), p. 102. The tcnn •^critcrion-rrfcitoccd tesf is being 
jcd here in its broadest sense and includes other tenns biich as content-, dOInain^ and objective-referenced tests. 

1 7.Q 
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Figura 6-B1*-8tato Requlreimnts: ComnMrclai Norm-Refarencecf 
AchtovaiMrrt Ibsts, 1990 




I states that require a commercial off-the-shelf 
norm-referenced achievement test (NRT), n»27. 
I States that require districts to select off**the-shelt 
NRTs from approved list, n«5. 
r~i State testing programs that do not require 

off-the-shelf NRTs, n-14. 
(ZZI No State mandated testing program, n-4. 



NOTE: Ksntuoky and Arlzons ars currwitly chanQino th^lr normHrsftrsncsd tstt (NRT) r»quirsm«nts (sm ch. 7). 
Although tows hat no Stats tssting rtqulrtmsnts. 95 psrosnt of Its distrfota admlnlfttsr a oommsrdal NHL 

SOURCE: Offics of Tschnology As$ssimsnt« 1992. 



Such tests will usually have designated cutoff 
scores or proficiency levels above which a student 
must score to pass the test. 

Another component of a standardized achieve- 
ment test that warrants careful scrutiny is the format 
of the test, the kind of items or tasks used to 
demonstrate student skills and knowledge. The final 
level in figure 6-1 depicts the range of testing 
formats. Almost all group-administered standard- 
ized achievement tests are now made up of multiple- 
choice items^ (see box 6-D). Currendy, educators 
and test developers are examining ways to use a 
broader range of formats in standardized achieve- 



ment tests. Most of these tasks, which range from 
essays to portfolios to oral examinations, are la- 
belled * 'performance assessment' ' and are described 
in the next chapter. 

Creating a Standardized Test: 
Concern for Consistency 
and Accuracy 

The construction of a good test is an attempt to 
make a set of systematic observations in an accurate 
and equitable manner. Iii the time period since 
Binet*s pioneering t^fforts in the empirical design of 



O number of cooimcrcially developed achievemeni tests have added optional direct sample writing tasks. 
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Box 6-C*->How a Standardized Norm-Referenced Achievement Test is Developed^ 

Step l--^pedfy generaS purpose of tiie test 
Step 2— Develop test speciflcatioiis or blueprint 

• UentUy the cooteiit that the test will coven for achievement tests this means specifying both the subject 
matter and tfie befaavkml objectives. 

• Conduct a cunicuhun analysis by leviewh^; cunent texts, cunricular guidcLu and researdi and by 
consulting expeits in the subject areas and skills sdected Through this process u consensus definition of 
hnportant content and skills is estaUished^ensurhig that the cmtent is valid 

Step 3— Wr^te items 

• Often done by teams of professional item writers and sub^ 

• Many more items are written than will appear on the test 

• Items arc reviewed for racial, edmic, and sex bias by outside teams of professionals. 

Step 4— Pretest items 

• Preliminary versions of the items are tried out on large, refwesentative samples of childicn. These samples 
must hiclude children of all ages, geognqphic regions, ethnic groups, and so foith wiA whom the test will 
eventually be used 

Step 5— Analyze items 

• Statistkal hiformaticm cidlected for each item includes measures of item difficulty^ item discrimination, age 
differaices in cashless, and anafysis of hiconect n sponses. 

Step 6-4i0cate standardization sample and conduct testing 

• lb obtate a national^ representative sample, puldishers select studm 

characteristics, hicludhig diose for hidividual pupils (e.g., age and sex), school systems (e.g., public, 
parochial, or private) and communities (e.g., geogn^Aical regions or uiban-iuial-subuiban). 

• Mostpublishmadniinistertwofomuof atestattwodiff^ 
standaidizatioa 

Step 7— Analyze standardization data^ produce norms, analyze reliability and validity evidence 

• Ahemate forms are statistically equated to one another. 

• Special norms (e.g.,fM'uiban or rural schools) are often prepared 

Step 8— Publish test and test manuals 

• Score reporting materials and guidelines are designed. 

^Adapted from Anthony J. Nisko. Educational Tssts and Measurtmem: An Introducrion (New York» NY: Harcoun Brace Jovauovich, 
1983)» pp. 468^76. 



tests,^ considerable research effort has been ex- 
pended to develop theories of measurement and 
statistical procedures for test construction. The 
science of test design, called psychometrics, has 
contributed important principles of test design and 
use. However, a test can be designed by anyone with 
a theory or a view to promote — ^witness the large 
number of ** tests" of personality type, social lOi 
attitude preference, health habits, and so forth that 
appear in popular magazines. Few me:;hanismB 
currently exist for monitoring the quality, accuracy, 
or CTedibility of tests. (See ch. 2 for further discus- 
sion of the issues of standards for tests, mechanisms 



for monitoring test use, and protections for test 
takers.) 

How good is a test? Does it do the things it 
promises? What inferences and conclusions can be 
drawn from the scores? Does the test really work? 
These are difficult questions to answer and should 
not be determined by impressions, judgment, or 
appearances. Empirical information about the per- 
formance of large numbers of students on any given 
test is needed to evaluate its effectiveness and 
merits. This section addresses the principal methods 
used to evaluate the technical quality of tests. It 



Q *Scc ch. 4, 
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Figure 6-2— Thorndike's Scale for Measuring 
Handwriting 



Quality 18 

.^aJiovucd Ifijot tiu, AAAJb oyrui puK of. tAjt tuUd 
* i^u, atOukxMAjrh of Uvt /moxm a/ruL ^AAX/n> vufi^on 



Quality 17 



Quality 14 



Quality 9 



Quality 5 



Quality 4 



NOTE: A 8«rte8 of handwriting 8p«dm«ns W8r« scaled on a numerical 
"quality" scala. To usa tha scala^ a ttudant'a aempla of writing la 
matched to tha quality of ona of tha spadmana arid assigned tha 
given numerical value. This figure shows only some of tha 
specimens, 

SOURCE: Anthony J. Nttko» Ecfucntiofmi TMta and MM$ur0m0nt: An 
Introduction (New Yc.1<» NY: Harcourt Brace Jovanovlch, 1 983)» 
p,450. 



begins by dissecting the basic definition of a test and 
then examines concepts of reliability and validity. 

What is a Standardized Test? 

This type of test is an objective and standardized 
method for estimating behavior based on obtmning 
a sample of that behavior.^ There are four key 
elements of this definition. 



Sample of Behavior 

Not all of an individual's behavior relevant to a 
given topic can be observed. Just as a biochemist 
must take representative samples of the water supply 
to assess its overall quality, a test obtains samples of 
behavior in order to estimate something about an 
individual's overall proficiency or skill level with 
.^pect to that behavior. Thus, to estimate a student's 
skill at arithmetic computations, a test might provide 
a number of problems of varying complexity drawn 
from each of the areas of addition, subtraction, 
multiplication, and division. The samples chosen 
must be sufficiently broad to represent the skill 
being tested. For example, performance on five long 
division problems would not provide an adequate 
estimate of overall computational skill. Similarly, a 
behind-the-wheel driving test that consists only of 
parking skills (parallel parking, backing into a 
space) would hardly constitute a valid indicator of a 
(hiver's overall competence. 

Estimation 

Precisely because much of human behavior is 
variable and because a person's knowledge and 
thinking cannot be directly observed, scores ob- 
tained on any educational test should always be 
viewed as estimates of an individual's competence. 
In general, the accuracy of estimates generated by 
tests will be enhanced when technical procedures are 
used to design, Held test, and modify tests during 
development. 

Standardization 

Standardisation refers to the use of a uniform 
procedure for administering and scoring the test. 
Controlling the conditions under which a test is 
given and scored is necessary to ensure comparabil- 
ity of scores across test takers. Each student is given 
identical instructions, materials, practice items, and 
amount of time to complete the test. This procedure 
can reduce the effects of extraneous variables on a 
student's score. Similarly, procedures for scoring 
need to be uniform for all students. 

Objectivity 

Objectivity in test construction is achieved by 
eliminating, or reduciag as much as possible, the 
amount of subjective judgment involved in develop- 



'Thc word ''behavior** is used here in its broadest sense and includes more specific constructs such as iaK)wlodgc» sldlis^ traits* and abilities. Tbis 
^''"ussion of the components of the definition of a test is drawn from Anastasi, op. cit, footnote 4. 
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Box 6-D— Large-Scale Testing Programs: Constraints on the Design of Tests 

The denupd for standardized tests of achievenient is driven by the need to collect comparable adiievment 
data about large numbers of students, schools, and districts. Ifests are requiied diat can be given to a large number 
of students simultaneously and in many school districts. Because of this, and more so than for most other kinds 
of tests, the technology of standardized achievement testing reflects the practical considerations of economy, 
efficiency, and limits on the amount of time that can be devoted to test taking. The need for efficiency and 
econ<Mny has a^ted the design of standardized achievement testing in at least three impoitant ways, each of 
which requires some tradeoff in the information obtained. 

Group administration— Most standardized achievement tests are group administered; laige numbeis of 
students take the test at the same sitting with no guidance by an examiner. Many other types of standaidized tests 
(e.g., personality, speech, and visual-motor skills) are individually administered by trained examinen who can 
ensure systematic administration and scoring of results. While far more labor intensive and time consuming, 
individual examiners can make observations of die student that provide a rich source of supplementary 
inforaiatim. Individually administered tests can also be tailored to the level of knowledge demonstrated by die 
diild and thus can cover a number of cmtent areas in some detail widiout bec(Mning too long or finistrating for 
die child. 

Machine scored— Most standardized achievement tests are scored by machine, because of the numbers of 
tests to be scored quickly and ecraiomicalfy. This need restricts the fonnat for student answers. Most 
machine-scored tests are made up of items on which students recognize or select a correct response (e.g., multiple 
choice 01 tnie-folse) radier than create an answer of dieir own. 

Broad, general content— The content of tests designed to be administered to all students will be broad and 
general when testing time is limited. TTie requirement diat an achievement test can be taken by students of all 
skill levels in a ghren age group means diat for every content area covered by die test, many items must be 
administered, ranging from tow to high levels of difficulty. Most students will spend time answering extra 
items — some too difficult, some too easy — in order to accommodate all test takers. 

Constraints 

The design of standardized achievement tests for use widi all students in a school system is Uierefore 
constraiiied by three factors: 1) die amount of testing thne available which constrains test lengtti, 2) die costs of 
test administration and scoring, and 3) die logistical constraints imposed by the large numbers of tests tiiat must 
be administraed and scored quickly. However, die tensi(m between die economy and efficiency needs, and die 
desire for rich, individualized information, underlies much of die current testing debate. 

Tliree major areas of technologicid development offer promise for expanding die range of possibilities for 
large-scale standardized achievement tests. 

Machine scoring— As die technology advances, machines and computers may be able to score more 
complex and sophisticated responses by students (see ch. 7). 

Individual administration via computer— The computer has considerable potential as a mediod for 
harnessing many of die important advantages of individualized test administration. These include die capability 
to adapt test items to match die proficiency of die student (allowing more detailed assessments in short lime 
periods), and to record steps taken by the test taker. In essence, tiie computer may be able to rcpUc e some of 
<he important but expensive functions previously served by a trained testing examiner (see ch. 8). 

Sampling designs— The technology of sampling, by which generalizable conclusions can be made based 
on testii.g of far fewer numbers of students, is an important development as well. The effectiveness of testing 
subgroups of children, or testing aU children on a portion of die test, has been weU demonstrated. This sampling 
mediodology offers a praciical avenue for trying some mwe expensive and logistically complex testing 
procedures, as every student in a system does not need to take tiie whole test. 

SOURCE: Office of Ibchnology Assessment, 19S)2. 
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ing, administering, and scoring the test. The goal of 
these procedures is to ensure that an individual 
receives a score that reflects his or her level of 
understanding and not the particular views or 
attitudes of persons administering or scoring the test. 
Hius, in theory an objective test is one on which the 
test taker wiU receive the san^ score regardless of 
who is involved in administering that test.^ 

Reliability of Test Scores^ 

As used widi respect to testing, reliability refers to 
the consistency of scores. If the goal is to estimate a 
child's level of mathematics achievement then the 
test should produce consistent results, no matter who 
gives the test or when it is given. If, at the end of 3rd 
grade, a student scores at the 90th percentile on 
Monday in mathematics achievement, but the 40th 
percentile when retested on Friday, neither score 
would instill much confidence. Scores can be 
inconsistent for a number of reasons: behavior varies 
from moment to moment, the content of a test varies, 
or the persons or procedures involved in scoring are 
variable. 

The theoretical ideal for score reliability is 100 
percent. In practice, though, it is impossible for an 
instrument that is calibrating human behavior to 
achieve this level of consistency. Any data from tests 
of human behavior contain some ''noise'' or error 
component that is irrelevant to the purpose of the 
test. The control of testing conditions through 
specification of procedures can reduce the variance 
ui scores due to these irrelevant factors, and make 
the test a more reliable indicator. However, because 
no test is perfectly accurate and consistent, it should 
be accompanied by evidence of reliability. (When 
public opinion polls are reported, for example, they 
are usually accompanied by statements that indicate 
how much the given figures might be expected to 
vary, e.g., **this number might be expected to vaiy 
4 points up or down." This statement provides 
information about the reliability of the poll . sti- 
mates.) 



As tests are currently designed, there are three 
principal ways to conceptualize the reliability of test 
scores. Estimates of reliability can be obtained by 
examining the consistency of a test administered 
across different occasions, lb what extent do scores 
obtained on one day agree with those obtained on a 
different day? This form of reliability is called 
stability. Secondly, consistency across content, ei- 
ther of different groups of items or forms of a test, 
can be examined. To whai extent does performance 
on one group of subtraction items agree with 
performance on a s .. nd group of subtraction items 
intended to assess the same set of skills? This form 
of rejliability can be assessed by alternate test forms 
or by indices of internal consistency. Finally, the 
extent to which consistent test scores will be 
produced by different raters can be assessed. 
what extent do the scores assigned by one judge 
reading an essay test and using a set of designated 
rating criteria agree with thc"^ given by another 
judge using the same criteria? Indices of inter-rater 
reliability are used to assess such agreement. 

Reliability is partly a function of test length. As a 
rule, the more items a test contains, the more reliable 
that test will be. As the number of items, or san^)les, 
incorporated in a score increases, the stability of that 
score will also increase. The effect of chance 
differences among items, as well as the impact of a 
single item on the total score, is reduced as a test gets 
longer. This is one of the reasons that multiple- 
choice and other short answer tests tend to be very 
reliable and consistent — many items can be an- 
swered in a short amount of testing time. As will be 
discussed in chapter 7, reliability of scores based on 
fewer and longer tasks is one of die important 
challenges faced by the developers of new perform- 
ance assessments. 

Reliability is particularly important when test 
scores are used to make significant decisions about 
individual students. Recall that any one test score is 
considered to be only an estimate of the person's 
**true" proficiency; this score is expected to vary 
somewhat from day to day. Reliability coefficients. 



■while scoriiij? of certain tests can be made almost perfectly objective by use of madune-/x;oring technologies (sec ch. 8). the writing of test questions, 
as well as the spec ificatio!i of what will be on the test and which is the ri^t answer, renuuns a fundamentally subjective activity requiring a great deal 
of hunumjudgmeor. 

^Tbe discussion of reliabiUty and validity draw on Anastasi, op. cit., footnote 4; Anthony I Nitko, Educatiofial Thsts and Measurement: An 
Introduction (New York, NY: Harcourt Brace Jovanovich, 1983); William A. Mehrens and Irvin J. i/thirmnti, Measurement and Evaluation in Education 
and Psychology, 3rd cd. (New York, NY: CBS CoUege PubUshlng, 1984); and American Educational Research Association, American Psycholo^cal 
Association, and National Council on Measure ent in Education, Standards for Educational and Psychological Testing (Washington, DC: American 
^"chological Association, Inc., 1585). 
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Box Test Score Reliability: How Accurate is the Estimate? 



All test scores are estimates of proficiency. ''Reliabil- 
ity** is a statistical indicator of die accuracy of those 
estimates: tests with higher reliability are, by definiticHi, 
more accurate instruments. F6r exam[de, if a test has a 
leliability coefficient of 0.85, this means that 83 percent of 
the variance in seems depends on tnie differences and IS 
percent is attributable to other fectors. 

Scores therefore need to be accompanied widi infoima- 
tion about die test*s reliability. Suppose, fyr example, 
students took a test of aridunetic pr^ciency widi high 
reliability, e.g., 0.95. As shown in figure 6-El, die range of 
error around scores on diis t^ is reladvely nanow: a score 
of 100 reflects a proficiency level of somewhm between 
93 and 107. On a test widi vwy low reliability, e.g., 0.40, 
die proficiency of a student who scores 100 may be 
anywhere from 77 to 123. 

Tliis information is particularly important when test 
scores are die basis ci decisions about students. The 
likelihood of incorrect decisions increases ^en a test*s 
reliability is low: e.g., students could be denied remedial 
services based on an erroneously high score or retained in 
a special program because of moneously low scores. 

SOURCE: Office of Tbchnology Assestment, 1992. 



Figuro 6«E1~Error Ranges on ibsts of 
Varying Reliability 
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SOURCE: ortlot of T«chnolooy A$MMm«nt, 1902. 



which estimate error, allow one to set a range of 
likely variation or * 'uncertainty** around that esti- 
mated score. Box 6-E illustrates how great the 
variation around a score can get as the reliability of 
a test decreases.^^ Interpretation of individual scores 
should always take into account this variability. 
Small differences between the test scores of individ- 
ual students are often meaiiingless, once error 
estimates are considered. When test scores are used 
for classification of people errors will be greatest for 
those whose scores are at or near the cutoff point. 

This suggests two important implications for the 
interpretation of individual scores in educational 
settings: 1) if a test score is used to make decisions 
about individual students, a very high standard of 
reliability is necessary,*^ an< 2) using test scores 
alone to make decisions about individuals is likely 
to result in higher rates of misclassification or 



incorrect decisions. With respect to educational 
decisions about individuals, test scores should 
always be used in combination with other sources of 
information about the child*s behavior, progress, 
and achievement levels. 

Vklidity Evidence for Tests 

**It is a useful oversimplification to think of 
validity as truthfuhiess: Etoes the test measure what 
it purports to measure?. . . Widity can best be 
defined as the extent to which certain inferences can 
be made from test scores.**^^ Widity is judged on 
a wide array of evidence and is directly related to the 
purposes of the test. 

Every test needs a clear specification of what it is 
supposed to be assessmg. So, for example, for a test 
of reading proficiency, test designers first need to 



^Reliability coefficients are based on the degree of relationship between two sets of scores. Correlation coefficients, generally signified with an '^r.* * 
range from 0.00 indicating a complete absence of relation to +1 .00 and -1 .00 indicating a perfect positive or negative relationship. Hie closer a reliability 
coefficient is to +1 .00, the better. ' 

»»Nitko, op. cit.. footnote 9, p. 405. 

>2john Salvia and /ames E. Ysseldykc. Assessment in Special and Remedial Education (Boston^ MA: Houghton Mlffiin Co., 1985), p. 127. 
^ *^Mehiens and Lchmann, op. cil., footnote 9, p. 288. 
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Photo cfmMtiAnmkmn Guidance Servtoos 

Some standardized tests, such as those used In special 
education evaluations, are Individually administered 
by a trained examiner. 

specify clearly what is meant by reading proficiency. 
Similarly, a test of diving skill needs to make clear 
what proficient dives look like. Before any testing 
can be done, a clear definition of the skills and 
competencies covered by the tesc must be made. 
There must be a definition of what the skill of 
interest looks like before anyone can decide how to 
test it. Once a method and a metric for assessing the 
skill has been chosen, validity evidence is gathered 
to support or refute the definition and the method 
chosen. 

EXAMPLE: A geometry teacher, who knows 
nothing about diving, is drafted to take over as coach 
of a high school diving team when the regular coach 
is taken ill. While watching the varsity and the junior 
varsity (JV) teams practice, he tries to develop his 
own definition of a skilled dive; noticing that highly 
ranked divers enter the pool with only a slight splash 
while JV team members tend to make lots of waves, 
he designs a 1-10 rating scale to measure diving 



prof ciency by judging the height of the splash as the 
diver enters the pool. While his criterion for 
measuring skill may be related to ''true divii^g 
skill,'' it is not valid as the primary indicator of 
diving skiU (as will be proven when he attempts to 
send his divers into statewide competition). In this 
case he has failed to define the trait of interest 
(diving skill) but rather jumped ahead to find an 
easy-to-measure indicator/correlate of diving skill, 
lb carry this example farther, as the practice dives 
are rated on this scale, his divers begin to modify 
their dives in the attempt to increase their scores so 
that they might go to the State competition. They 
develop inventive ways to enter the water so that 
splashing is minimized. Slowly, their relative ranks 
(to each other) change and some JV members move 
up onto the varsity team. Finally, the best eight 
divers (judged on the 1-10 splash scale) are sent to 
statewide competition. Their scores are the lowest of 
any team and their awkward, gyrating dives send the 
spectators into an uproar. The most ''truly'' skilled 
divers from the team, who stayed home, never had 
a chance to compete.^"^ 

This example illustrates what can happen when an 
invalid measure is used. Often it is hard to define 
excellence or con^tence, and far easier to chose an 
easy-to-measure and readily available indicator of it. 
WMe many of these easy-to-measure characteristics 
may be correlated widi exceUence, they do not 
represent the universe of characteristics that define 
competence in the skill of interest. What can happen 
(as in this case) is that students practice to gain more 
skill in the measurable characteristic, often to the 
exclusion of other equally valid — but less readily 
measured — aspects of the skills. In this example, the 
coach should have first developed a definition of a 
skilled dive. Since statewide competition is a goal, 
he would do well to adopt the consensus definition 
and rating scale that is used by judges in the 
competition. This scale has developed validity over 
many years of use through a process of diving 
experts defining and describing: first, what skill in 
diving is and second, what level of skill one needs to 
get each score on a scale of 1 to 10. 

The most often cited form of validity needed for 
achievement tests is called content validity. Estab- 
lishing content validity is necessary in order to 
generalize from a sample to a whole domain — ^for 
example, a sample of science questions is used to 



Q '^Oftice of Ibchnology Assessmeot, 1992. 

ERIC 



186 



178 • Testing in American Schools: Asking the Right Questions 



generalize about overall science achievement. Does 
the content sampled by the test adequately represent 
the whole domain to which the test is intended to 
generalize? The tasks and knowledge included on a 
test of writing proficiency, for example, should 
represent the whole domain of skills and knowledge 
that educators believe to be important in defining 
writing proficiency. Since the whole domain can 
never be described definitively, the assessment of 
content validity rests largely on the judgment of 
experts. First the domain must be defined, then the 
test constructed to provide a representative sample 
across the domain* 

There is no commonly used statistic or numerical 
value to express content validity. The traditional 
process for providing content-related validity evi- 
dence is a multifaceted one that includes review of 
textbooks and instructional materials, judgments of 
c!jrriculum experts, and analysis of vocabulary. In 
addition, professionals fi'om varying cultural and 
ethnic backgrounds are asked to review test content 
for appropriateness and fairness. The selection of 
test items is also influenced by studies of student 
errors, item characteristics, and evidence of differen- 
tial performance by gender and racial-ethnic groups. 

The content validity of an achievement test finally 
rests, however, on the match between the test 
content and the local curriculum. Thus a school 
system selecting a test must pay careful attention to 
the extent to which test learning outcomes match the 
desired learning outcomes of the school system. ^^A 
published test may provide more valid results for 
one school program than for another. It all depends 
on how closely die set of test tasks matches the 
achievement to be measured.* 

Another kind of validity evidence, called criterion- 
related, concerns the extent to which information 
from a test score generalizes to how well a person 



will do on a different task. In this case, validity is 
established by examining the test's relation with 
another criterion of importance. For example, the 
Scholastic Aptitude Ibst (SAT), which is used to 
help make decisions about college admissions, is 
designed to predict a specific criterion, i.e«, freshman 
grade point average (GPA). One kind of validity 
evidence required for any selection test is a demon- 
strated relation to the outcomes being predicted.^*^ 

A third kind of validity evidence, construct* 
related, has to do with providing evidence that the 
test actually measures the (rait or skill it attempts to 
measure* Is a test of science achievement actually 
measuring knowledge of science and not some other 
skill such as reading achievement? Do scores on a 
mathematics achievement test really reflect the 
amount of mathematics a child has leamed in school 
and not some other characteristic such as ability to 
work quickly under time pressure? Evidence for 
construct validity is gathered in multiple ways. 

One conunon form of construct validity for 
achievement tests relates to whether or not perform- 
ance on the test h affected by instruction. Since an 
achievement test is, by definition, intended to gauge 
the efiects of a specific form of instruction, then 
scores should increase as a result of instruction. As 
the kinds of tests and tasks required of children on 
tests change, it will be important to conduct validity 
studies to make sure tests are sensitive to instruction. 
Care needs to be taken to assme that new tests 
designed to assess thinking skills or complex 
reasoning actually do assess the skills that can be 
taught in classrooms and leamed by students. 

Evidence that tests of specific skills such as 
reading comprehension, spelling, and vocabulary^^ 
are actually assessing the skills they are designed to 
measure is particularly important if those scores are 
going to be used to diagnose a child's strengths and 



J^Norman E. Gronlund and Robert L. Linn^ Measurement and Evaluation in Teachings 6th cd. (New York, NY: MacMillan Publishing Co., 1990), 
p. 55. 

I'^The Scholastic Aptitude Ibst (SAT) is not considered an achievetnent test, but rathei a test of ''developed abilities*' which consist of . . broadly 
iq)plicable intellectual skills and knowledge that develop over time through the individual's experiences both in and out of school.* * (Anastasi, op. cit, 
footnote 4, p. 330.) The SAT is not intended to serve as a substitute for high school grades in the prediction of college achievement; in fact, hi^ school 
grades predict college grades as well, or sli^tly better than does the SAT. However, when test scores are combhied with high school grades, piediction 
of college grades is enhanced slightly. This "third view* * of colleg^•bound candidates (supplementiog grades and personal information from applications, 
interviews, and reference letters) was seen originally as a v^y to offset potential inequities of the traditional system; see also James Grouse and Dale 
Ihisbeim* *'The Against the SAX** Ability lasting: Uses, Consequences, and Controversies, part I, Alexandra K. Wigdor and Wendell R, Gamer 
(eds.) (Washington, DC: National Academy Press, 1982). 

i^The si;(btest$ that typically app^4ir on survey achievement batteries hiclude vocabulary, word recognition skills, reading comprehension, language 
mechanics (e.g.. capitalization and punctuation), language usage, mathematics problem solving, mathematics computation, mathematics concepts, 
Q Uing, Iai]guage, scieiKe, social studies, research skills, and reference materials. 
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weaknesses. Similarly, scores designed to assess 
''higher order thinking** need validity evidence to 
support the assumption that they are capturing 
something distinctly different from other scores 
assumed to include only **basic skills.** These other 
forms of construct validity have often been ne- 
glected by developers of standardized achievement 
tests.^^ Results of a recent survey of the technical 
characteristics of 37 published educational achieve- 
ment tests indicate that while 73 percent of the tests 
presented information about content validity, only 
14 percent presented criterion-related validity, and 
11 percent construct validity evidence.^ 

Sometimes the argument is made that if a test 
resembles the construct or skill of interest, then it is 
valid. This is commonly referred to as face validity 
because the test looks like the constract it is 
supposed to be assessing. Because, for example, a 
test item seems to require complex reasoning, it is 
assumed to be an indicator of such reasoning. 
However, face validity is very impressionistic and is 
not considered sufficient kind of evidence for 
serious assessment puiposes.^^ 

The kinds of evidence discussed above constitute 
empirical or evidential bases for evaluating the 
validity of a test. Recently, however, some investi- 
gators have drawn attention to the importance of 
considering the consequential basis for evaluating 
the validity of test use. The questions posed by this 
form of validity are ethical and relate to the 
justification of the proposed use in terms of social 
values: ' '. . . should the test be used for the proposed 
purpose in the proposed way?**^ 



For example: 

. . . tests used in the schools ought to encourage 
sounddistributionofinstructionalandstudytime. . . . 
The worth of an instructional test lies in its contribu- 
tion to the learning of students woridng up to the test 

or to next y^w*s quality of instruction The 

bottcmi line is that validators have an obligation to 
review whether a practice has apinopriate conse- 
quences for individuals and institutions, and espe- 
cially to guard against adverse consequoices.^ 

How are Achievement Tests Used?^ 

A precise description about how schools actually 
use achievement tests is difficult to obtain. Although 
there are many testing requirements imposed on 
children on their joumey through elementary and 
;>econdary schools, it is difficult to say with any 
certainty how results are actually used, or by whom. 
Once a test is needed for a specific purpose such as 
determining eligibility for a compensatory educa- 
tion program* cost and time constraints often dictate 
that the test information is used for other purposes as 
well. In addition, the results of a test administration, 
once received by a school, are available to many 
people unfamiliar with the specific test adminis- 
tered. Ibst scores often remain part of a child's 
permanent record and it is unclear how they might be 
used, and by whom, at some future point. It is 
difficult to prevent use of the test information for 
odier purposes once it has been collected. 

The multiple uses of achievement tests in school 
systems can be broadly grouped into three major 
categories.^ (See table 6-1 for a summary of these 
functions.) 



^^James L. Widdrop, * 'Review of the Califonua Achievcmen! Tbsts, Forms E and F/ * Tbnth Mental Measurements Yearbook, Jane Qosc Conolcy 
and Jack J. Kiamer (eds.) (Lincolii, NB: The University of Nebraska Press, 1989), p. 131. 

^Broce Hall, **Survey of the Tbchnical Characteristics of Published Educational Achieven^m Ibsts/' Educational Measurement: Issues and 
Practice, spring 1985, pp. 6-14. 

2>Mchrens and l eh mao n , op. cit, footnote 9; Roger Fair and Beverly Farr, Integrated Assessment System: Language Arts Performance Assessment, 
ReadingfWriting, technical report (San Antonio, TX: The Psycholc^ical Corp., 1991); Anastasi, op. cit., footnote 4. 

22samuel Messick, ••Ibst Widity and the Ethics of Assessment,** American Psychologist, vol. 35, No. 11. 1980, pp. 1012-1027. Sec also Samuel 
Messick, **Nyidity,*' Educational Measurement, 3rd ed., Robert Linn (ed.) (New York, NY: MacMillan PubUshing Co., 1989). 

^LceJ.Cronbach, **Five Perspectives on the Mdidity Argument,** l^j/UiM/y, Howard Wainer and Henry L Braun(eds.) (Hillsdale, NJ.Lawience 
Erlbaum, 1988), pp. 5-6. 

^This discussion of purposes draws on Jason Millman and Jennifer Oitene, **The Specification and Development of Ibsts of Achievement and 
Ability* * in Linn (ed.), op. dt., footnote 22, pp. 335-367; C.V. Bunderson, J.B. Olsen, and A. Orecnberg, * 'Computers in Educational Assessment,* * OTA 
contractor report, Dec. 21, 1990; J.A. FicchtUng, ••AdministraUve Uses of School Tfesting Programs,** in Linn (ed.), op. cit., footnote 22, pp. 475-485; 
and K Danell Bock and Robert J. Mislevy, ^'Comprehensive Educational Assessment for the States: The Duplex Design,** CRESST Evaluation 
Comment, November 1987. 

2' Although many authors have discussed these three major categories, these distinctions are drawn most directly from Lauren B . Resnick and Daniel 
P. Resnick, * 'Assessing the Tfainking Curriculum: New Tbols for Educational Refonn,* * Future Assessments: Changing View's of Aptitude, Achievement, 
instruction, BJL Oifford and M.C. 0*Connor (eds.) (Boston, MA: Kluwer Academic Publishers, 1989). 
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Table 6-1— Three Major Functions of Educational Tests 



Functions 



Examples 



1 . Classroom Instructional guidance 

Used to nmtltor and provide feedback about 
the progress of each student and to Inform 
teaching decisions about Indvlduals on a 
day-to-day basis 



2. System monitoring 

Used for monitoring and making 
administrative decisions about aggregated 
groups of students (e.g., a scl'^l, 
Instructional progranns, curricula, district) 



3. Selection, placement, and certification of 
students ^gatekeeping") 

Used to allocate educational resources and 
opportunities among Individuals 



Diagnose each students strengths and 
weaknesses 

Monitor the effects of a lesson or unit of study 
Monitor mastery and understanding of new 
material 

Motivate and organize students' study time 
Adapt curikxikjm to progress as lndk)ated by tests 
Monitor p)x>gre8S toward curricular goals 
Plan lessons that build on students' level of 
current understanding 

Assign students to learning groups (e.g., reading 
group) 

Report to parents and school board about a 

school or district's peiformance 

Make dedskms about Instructional programs and 

curricuhjm changes 

Evaluate Chapter 1 programs 

Evaluate experimental or Innovative progranrts 

Allocate funds 

Evaluate teacher perfornoance/school 
effectiveness 

Provide general InfornDatlon about performance 
of the overall educational system 



Selection: 

• Admission to college or private schools 
Placenf)ent: 

• Race students In remedial programs (e.g., 
Chapter 1) 

• Race students In gifted and talented pre. rams 
Certification: 

• Certify mlninrtum competency for receipt of high 
school diploma 

• Certifymasteryof a course of study (e.g., 
Advanced Placement examinations) 

• Make decisions about grade pronfK>tlon 



SOURCE: dflce of Technology Assessment, 1992. 



The first broad category encompasses the kind of 
tests that can support and guide the learning process 
of each individual student in the classroom. These 
tests can be used to monitor and provide feedback 
about the educational progress of each student in the 
classroom, to diagnose areas of strength and weak- 
ness » and to inform teacher decisions about how and 
what to teach based on how well students are 
learning the material. 

ITie second major function — system monitoring — 
encompasses the many managerial uses of tests to 
monHor the educational system and report to the 
public. In these uses, what is needed is aggregated 
infomiation about the achievement of groups of 
students — ^from classrooms to schools, from districts 



to States. School administrators use this data to 
make decisions among competing curricula or in- 
structional programs and to report to the public 
about student achievement. In addition, test scores 
are increasingly being used as accountability tools to 
judge the quality of the educational system and those 
who work for it. Ibsts used as accountability tools 
are often intended to allow a public evaluation of 
whether or not standards are being met.^ 

The third broad category of uses is also manage- 
rial, called here selection, placement, and certifica- 
tion. Included in this broad category are tests used to 
make institutional decisions a:ffecting the progress 
of individual students through the educational sys- 
tem. Comparable information is needed for each 
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^Frechtlii]g» op. cit., footnote 24. 
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Table &*2--Consumers and Uses of Standardized Test Information 



Consumer Unit of analysis 

Na^'vioi tevel 

* *ion of resources to programs and priorities Nation, State 

ital program evaluation (e.g., Ch^jpter 1) State, program 

State legislature/State department of education 

Evaluate State's status and progress relevant to standards State 

State program evaluation State, program 

Allocation of resources District, school 

Public (lay persons, press, school board nf)embers, parents) 

Evaluate State's status and progress relevant to standards District 

Diagnose achievement deficits Individual, school 

Develop expectations for future success In school Individual 

School districts— central administrators 

Evaluatedlstrlcts District 

Evaluate schools Schools 

Evaluate teachers Classroom 

Evaluate curriculum District 

Evaluate instructional programs Program 

Determine areas for revision of curriculum and Instruction District 

School districts— building administrators 

Evaluate school School 

Evaluate teacher Classroom 

Group students for Instruction Individual 

Place students into special programs Individual 

School districts— teachers 

Group students for instruction Individual 

Evaluate and plan curriculum Classroom 

Evaluate and plan Instruction Classroom 

Evaluate teaching Classroom 

Diagnose achievement deficits Classroom, Individual 

PronfK>tion and graduation Individual 

Place Into spedaJ programs (e.g., gifted, handicapped) Individual 

Educational laboratories, centers, universities 

Policy analysis All units 

Evaluation studies All units 

Other applied research All units 

Basic research , All units 



SOURCE: Thomas M. Haladyna, Susan Bobbtt Nolan, and Nancy S. Haas, "Raising Standardized Achlavamant Tast 
Scoras and the Origins of Tost Score Pollution/* Educational R0S0arch0ft vol. 20, No. 5, Juna^uly ^ 091 » 
p. 3. 



individual student 50 that managerial decisions can 
be made about the allocation of additional resources, 
placement in instructional programs, and certifica- 
tion of mastery. Increasingly test scores have been 
used to make such decisions because they are 
perceived to provide clear, objective criteria. Thus, 
eligibility for a compensatory education program 
(e.g.. Chapter 1) might be determined by a district 
policy that states a cutoff score below which 
children must score to qualify. Qualifying for an 
enrichment program might be contingent on scoring 
above some designated level on a standardized test. 



The results of these tests clearly have significant 
implications for a student's progress through the 
school system.^^ 

Consumers of Achievement Tests 

In addition to the many uses for achievement 
test-based information, there are many different 
consumers or users who need that information. The 
kiii'i of information needed is often very different 
dependini; on who wants it. Table 6-2 summarizes 
the major consumers of test-based information as 



^^nically, while most of the supplementary resources allocated by scboob are likely to be targeted tu child to scoring either quite bw or quite high 
on these tests, the nonn-referenced achievement tests routinely used by most school districts are designed to measure most accurately in the middle of 
achievement distribution rather than at either the highest or the lowest ends. 
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well as the most common uses of each consumer.^ 
Within the educational system there are multiple 
levels of need for test-based infomiation including 
Federal, State, district^ school^ and classroom infor- 
mation. Policymakers and legislators need the infor- 
mation, as well as education departments. Ifeachers, 
parents, students, and the public also require test- 
based information about achievement. 

Mandatory schoolwide testing programs, in which 
each child in a given grade takes the same test» have 
become routine. Some tests are required at the 
Federal level, e.g., for Chapter 1 accountability,^^ 
some mandated by the States, and others imple- 
mented by local school districts. Because most 
school districts want to keep testing requirements to 
a minimum, a test is often chosen that can serve as 
many uses and consumers as possible. 

Figure 6-3 illustrates the mandated schoolwide 
tests given in grades 1 through 12 for three large 
school districts. State-mandated testing require- 
ments, ' vhich have increased in overall numbers in 
recent years, account for only a fraction of the total 
testing burden. Additional tests (not listed in the 
table) are also administered to some subgroups of 
children who need to be screened for special 
services. For example, although some districts may 
use schoolwide tests to satisfy Federal-level Chapter 
1 accountability requirements (Philadelphia uses the 
City Wide Tfest for this purpose), many children who 
receive Chapter 1 services will take tests in addition 
to those listed in the table. 

Although the specifics of who actually uses test 
results and for what purposes remain difficult to 
document, evidence suggests that requirements re- 
garding standardized achievement tests are imposed 
largely to serve the two broad managerial purposes — 
system monitoring; and selection^ placement, and 
certification. There arc few standardized tests de- 
signed explicitly to help teachers assess ongoing 
classroom learning and inform classroom practice. 
Furthermore, evidence also suggests that teachers 
find the results of ex' ;tmg standardized achievement 
tests only generally useful for classroom practice. In 
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Teachers need tests that are dosoly matched to 
Instruction and that provide detailed Infornnatlon about 

student progress on a frequent t>a8is. This kind of 
Information, which can help teachers Influence learning 
and guide Instruction, Is very different from the kind of 
Information school administrators need to 
monitor school systems. 

one study that interviewed teachers, 61 percent 
repoited that standardized tests have little effect on 
their instructional decisionmaking.^^ 

Current achievement tests do a good job of assessing 
a student's general level of knowledge in a particular 

content domain A low score relative to a 

student's grade placement on, say, a reading compre- 
hension test is apt to be a valid indicator that a 
student will have difficulty reading and understand- 
ing assignments in the typical textbooks used at the 
grade level. Such global information, however, is 
more likely to confirm what the teachers ahcady 
know about the student than to provide them with 
new insights or clear indications of how best to help 
the student. The global score simply does not reveal 
anything about the causes of the problem or provide 
any direct indications of what instmctional strategies 
would be most efifective.^^ 



^Scc also Bock aod Mislevy, op. cit, footnote 24, for a similar list and analysis of test consumers. 

^Chapter 1 is a Federal compensatory education program serving low-achieving studimts from low-income schools. See ch. 3 for a fuller discussion 
of the testing and evaluation tequirements under Chapter 1. 

3^obcrt B. Ruddell, "Knowledge and Attitudes Ibward Tbsiing; Field Educators and Legislators/* The Reading Teacher, vol. 3S9, 1985, pp. 
538-543. 

^iRobert L. Linn, * 'Barriers to New Tbst Designs/ * The Redesign of Testing for the 21 st Centtiry (Princeton* NJ; Educational Tbsiing Service, Oct. 
O^ . 1985). p. 72. 
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to Figure &>3--Testing Requirements: Three District Examples 
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A Child going to school In these districts would take each test lis'ed. 



Phltadetphia, PA 



Grade 




PMET 
CWT 



PMET 
CWT 



TELLS 
PMET 
CWT 



PMET 
CWT 



TELLS 
PMET 
CWT 



PMET 
CWT 



PMET 
CWT 



TELLS 
PMET 
CWT 



CWT 



-r 



CWT 

— I — 
10 



CWT CWT 



— r- 
11 



—J— 
12 




Springfield, MO 



Grade 




FGRMT MMAT MMAT MMAT MMAT MMAT MMAT MMAT DAT MMAT 



1 



9 



10 



11 



Milwaukee. Wl 



Grade 




MPS ORT 
ITBS DPIRT 

— I 1 



ITBS 



Comp M Comp W 
Comp R Comp L TAP 
ITBS DAT TAP P-ACTf TAP 

— ] 1 1 1 1 — 

7 8 9 10 11 



— 1 1 

12 




Comp L Competency language FQRMT 

Comp M Competency mathematics ITBS 

Comp R Comp4>: 'ncy reading MMAT 

Comp W Competency writing Mi 5 ORT 

CWT Philadelphia Clty WIde Test PMET 

OAT Differential Aptitude Tost TAP 

DPI RT DP) Reading Test TELLS 



Rrat Grade Reading and Math Test 

Iowa Tests of Basic Skills 

Missouri Mastery and Achievement Test 

Milwaukee Public Schools ORT Language Tesf 

Phlladelph'a Mathematical Evaluation Test 

Test of Achievement and Proficiency 

Test of Essential Learning and 

Literacy Skills (PA Stcte test) 



NOTE: If students have spedal needs or are in supplementary programs (e.g.. Chapter 1 or gifted programs) they w)ll usualty take additional teets. 

SOURCES: Milwaukee Public Schools. ''Summary Report and Recommendations of the Assessment Task Force." unpublished report, June 2, 1 969; Springfield Publk) Schools, 1990; Nancy Kober, 
The Federal Framework for Evaluation and Assessment In Chapter 1, ESEA," OTA contractor report. May 1991. 
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Ifeachers desiie diagnostic tests that are precise, 
closely matched to curricula and instruction and 
timely. Achievement tests of the kind now widely 
used do not match these criteria*^^ 

Part of the reason that few existing standardized 
tests are ^licable for classroom use» however^ has 
to do with local control of curriculum. Achievement 
tests are designed to match the goals and objectives 
of the content being taught; the \ ^lidity of an 
achievement test rests largely on the de. tee to which 
it mirrors the content being taught in the classroom. 
A test that contains a great deal of content not 
covered by the curriculum in a particular school is 
said to be **content invalid'* for that school. 
Ifeachers, because they know what they are teachings 
can design tests that are well aligned with the 
curriculum. If an examination is designed at a great 
distance from the local classroom (as commercially 
produced and published tests are bound to be) it is 
less likely to reflect the specific curricular content of 
the classroom; these tests wiE largely reflect only 
those broad content areas and skills that are common 
across school settings and on which (here is implicit 
consensus.^^ Thus, tests that are precise and closely 
matched to curricula, and therefore useful to teach- 
ers, will need to be designed at the local level, close 
to where specific curricular goals and objectives are 
set. **Oeneric'' standardized achievement tests as 
currently designed cannot be both specific enough to 
assist teachers on an ongoing basis and generic 
enough to be useful to large numbers of school 
systems. 

Most mandated, standardized testing is put in 
place for managerial purposes and not for purposes 
related to shaping directly day-to-day learning 
processes in classrooms. Since such tests are gener- 
ally given once a year, they can offer teachers a 
**snapshot' * of a child's achievement at one particu- 



lai point in time, but offer little information about 
the ongoing, ever-changing process of a child's 
learning and development.^ 

The social success of testing in many ways is a 
product of the bureaucratization of education, last- 
ing seems not so important in the stuff of teaching 
and learning, where surely there must be much 
personal contact, but rather in the interstices of our 
educational institutions— entiy into elemmtaiy 
school, placement in special classes, the transition 
from elementary to secondary school, high school 
leaving and college going.^' 



Test Misuse 

It is difficult to make general statements about the 
misuses of tests, because each test has to be 
evaluated with respect to its own specifications and 
technical evidence regarding the validity of its use 
for specific purposes.^^ Many different tests are used 
by school systems, some commercially designed, 
some designed by districts or jtates. However, 
results of one survey of mathematics teachers shed 
some light on the uses of well-known commercial 
achievement tests. In this survey, three commercial 
tests were found to account for 44 percent of district 
testing requirements. In districts where these three 
tests were used about two-tliirds of the teachers 
reported their use by the district to group students by 
ability and to assign students to special programs. 
Howevei , technical reviews of these three tests have 
suggested that evidence is lacking regarding infer- 
ences about student diagnosis and placement for 
these tests.^^ One reviewer cautioned about one of 
these tests that: . . although useful as an indicator 
of general performance, the usefubess of the test for 
diagnosis, placement, remediation or instructional 
planning has not been validated. 



3^sUe Salmon-Cox, ••'Ibachcrs and Standardized Achievement Tbsts: Whafs Really UuppcohigV* Phi Delta Kappan, vol. 62. No. 9. 1981, p. 634. 

^^Sce. e.g., Roger Farr and Robert F. Carey. Reading: What Can be Measured? 2\A ed. (Newark, DE: International Reading Association, Inc., 1986). 
p. 149. * 

^TTie majority of districts test at the end of the school year and the results are often received too late to be of help to that year's classroom teacher. 
Some districts test more than once a year. 

^^Waltcr Haney. •'Tbsting Reasoning and Reasoning About Tfesting." Review of Educationa! Research, vol. 54. No. 4. 1984. p. 641. 

3«See abu Robert L. Unn, Center for Research on Evaluation, Standards and Student Ibsting. University -^f Colorado at Boulder. * 'Tfest Misuse: Why 
Is It So Prevalent?" OTA contractor report, September 1991; Larry Cuban. Stanford University. • 'The Misuse of Tfests in Education.* ' OTA contractor 
report. Sept. 9. 1991; and Nelson Noggie, "The Misuse of Educational Achievement Tfests for Grades K-12: A Perspective.* * OTA contractor itport 
October 1991. *^ 

"T Romberg, Ej\. Zarinnia. and S Jl. Williams. The Influence of Mandated Testing on Mathematics Instruction: Grade 8 Teachers Perceptions 
(Madison WI: National Center for Research in Mathematical Sciences Education. March 1989). 

3«Peter W. Airasian. * 'Review of the California Achievement Tbsts, Forms E and F." Jane Qose Conoley and Jack J. Knuncr (cds.). The Tenth Mental 
O isuretnents Yearbook (Lincohi. NB: The University of Nebraska Press. 1989). pp. 719-720. 
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Although most standardized achievement tests 
are not designed to be used as selection or placement 
instruments on which to base judgments about future 
proficiency or capability, there are few mechanisms 
to prevent such uses. Ibsts that are going to be used 
for selection should be designed and validated for 
that purpose. Ibsts designed to be used as feed- 
back mechanisms to inform the learning process 
should not be used to make significant decisions 
about an individual's educational career unless 
additional evidence can be provided substantiat- 
ing this use. However, there are few safeguards 
available to make sure this does not h^^pen. 

One of the most consistent reconunendations of 
testing experts is that a test score should never be 
used as tfie single criterion on which to make 
decisions about individuals. Significant legal chal- 
lenges to the over-reliance on IQ test scores in 
special education placements led to an exemplary 
federally mandated policy on test use in special 
education decisions. In Public Law 94-142, Con- 
gress included several provisions designed to protect 
students and ensure fair, equitable, and non- 
discriminatory assessment procedures. Among these 
were: 

• decisions about students are to be based on 
more than performance or a single test, 

• tests must be validated for the purpose for 
which they are used, 

• children must be assessed in all areas related to 
a specific or suspected disability, and 

• evaluations should be made by a multidiscipli- 
nary team.^^ 

This legislation provides, then, a number of signifi- 
cant safeguards against the simplistic or capricious 
use of test scores in making educational decisions. 
Similar safeguards are ne;ded to prevent over- 
reliance on single test scores to make educational 
decisions abo' , all students, not just those in special 
education programs."^ 

Other examples of test misuse arise when results 
of available tests are used in the aggregate to make 
unsupportable inferences about educational effec- 
tiveness. The use of college admissions tests (SAT 
and the American College Tfesting program — ^ACT) 



Photo cr0dt: EducatkHml Tostng S0n/lc^ 

Some standardized tests are used to melts significant 
dedslonsaboutthe progressof Individual students through 
the educational systeni. These tests must nteet very 
high technical standards and are nrx)8t subject to 
scrutiny and legal challenge. 

to compare the quality of education in various States, 
as m the ••Wall Qiarts'* produced by the U.S. 
Department of Education, is one prominent exam- 
ple. The SAT is taken by different numbers and 
samples of students (none of them randomly se- 
lected) in each State. Further, inferences about the 
achievement levels of high school seniors should be 
made only from a test designed to sample what high 
school seniors have been taught. Tbe SAT is not 
designed for this purpose— it is designed to piedict 
success (grade point average) in the freshman year of 
college. College admissions tests are designed for a 
distinctly different purpose than informing policy- 
makers interested in educational quality ."^^ In some 
respects it is similar to using a test of reading 
achievement to draw conclusions about mathemat- 
ics achievement; although the two are likely to show 
some relation to one another, it would be erroneous 
to draw conclusions and make decisions about 
mathemadcs based on test scores in reading. 

Changing Needs and Uses for 
Standardized Tests 

Current disaffection with the widely used existing 
standardized tests rests largely on three features of 
those tests: 1) most are norm-referenced and thus 



ERIC 



^'Salvia aod Ysscldykc, op. cit., footnote 12. 

^See ch. 2 for further discunsion of test misuse and mechanisim for eofotting appropriate testing practices. 
^>Sec Robert L. Linn, Accountability: The Comparison of Educational Systems and the Quality of Ibst Results/* Educational Policy, vol. 1, No. 
V _ X 1987, pp. 181*198, for iurthcr discussion of the problems involved in using test scores to compare educational quality across States. 
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scores are based on comparing students to one 
another; 2) most are exclusively made up ci multiple- 
choice items; and 3) their content does not ade- 
quately represent local curricula^ especially those 
parts associated with thinking and reasoning skills. 
Most of the new developments in test design and 
alternative forms of assessment reflect a move away 
from this one dominant testing technology. What 
features do innovators seek in other designs? 

What Should the Yardstick Be? 

Traditional test theoiy and techniques of test con- 
structicHi have been developed on the assumption 
that the purpose of a test is to discriminate among 
individuals. If the purpose of a test is to ccmipare 
each individual to a standard, then it is irrelevant 
whether or not the individuals differ from each 
other.^^ 

Recent attempts to develop altemative tests repre- 
sent a move away from the traditional testing model 
built on comparing individuals to one another. 
Instead, new testing developments represent at- 
tempts to extend the criterion-) eferenced model of 
testing and design ways to assess students against 
criteria and goals for achievement* 

There are two main reasons that existing norm- 
referenced tests tend to provide broad coverage of a 
limited number of content areas. First, these tests are 
designed to be taken by students of all skill levels in 
a given grade; this means that for every content area 
covered by the test, many items must be adminis- 
tered, ranging from low to high levels ot difficulty. 
Most students will spend scarce testing time answer- 
ing extra items — some too difficult, some too 
easy — ^included in order to acconmiodate all test 
takers. This means that fewer content areas can be 
covered in a limited amount of testing time. Second, 
NRTs must concentrate on those content areas that 
are common to most schools throughout the country. 
In essence, the content areas represented on NRTs 
represent broad and generally implicit national 
consensus about the core skills that children should 
know at each grade level. If these tests are primarily 
tests of basic skills, as many have argued, it may be 
because it is these skills that are common to the 



majority of curriculum frameworks throughout the 
country. Because of the way NRTs are developed, 
the content areas included can only represent a 
subset of the content areas covered in any particular 
school. Arizona, for example, found that only 26 
percent of their curriculum goals were covered in the 
NRT they had been using. Thus, existing NRTs will 
only assess a limited set of content areas and only in 
a very general way. However, they can provide a 
basis for comparing children across the Nation on 
that conunon general content. 

Comparing children across the Nation on what 
they have been taught, without setting any standards 
or goals as to what they should have been taught, 
entails testing only those skills for which there is an 
implicit national consensus — ^which is also likely to 
be the ''least common denominator^' of academic 
content. Local control over curricula means that 
each district can decide what skills and knowledge 
fourth graders should have, for example. Tb com- 
pare them fairly, one can only use a test that 
represents content all children have been taught. 
However, if one is willing to arrive at some kind of 
consensus about what children should know at 
various age levels, then tests can be designed to 
represent those areas.^^ 

Criterion-referenced tests (CRTs) can provide 
specific information that is directly tied to the 
curricula being delivered in the classroom. Most 
tests need to be developed locally to achieve this 
level of specificity. Many States have, in recent 
years, implemented a CRT statewide program in 
order to assess progress on State-mandated goals and 
skills. However, many people, from policymakers to 
parents, also want a method for referencing how 
students are doing with respect to the education of 
the whole Nation. Parents and policymakers want 
assurance that children are not just getting the set of 
skills and knowledge that would make them success- 
fril in Wyoming, for example, but rather that the 
received education is preparing children for the 
national workplace and postsecondary educational 
institutions. Because States and districts continually 
need to evaluate their own goals and curriculum, 
data comparing iheir students to students across the 



^^Mdireos and liehmann, op. dt, footnote 9, p. 21. 

^Another important aspect of the design of nonn-icfercnced tests has to do with the way items are finally selected to appear on the test * *One of the 
most important criteria for deciding whether to retain a test item is how well that item contributes to the variability of test scores. * * Rudner et al. (eds.), 
op. cit» footnote 2, p. 12. In this inode!«iteois that are too easy or too difficult inay be eliminate 

learning goals. For exan^le, infonnation that has been mastered by all children of a givsn age may not appear on the test because this information does 
O lot describe the differences in what they know, 
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Nation can provide an important perspective on the 
relative success of their educational efforts. At the 
present time, nationally norm-referenced standard- 
ized achievement tests are the okiiy mechanism 
available for achieving this type of ''national 
calibration/'^ Thus many States and districts will 
adopt an overall testing program that uses both an 
NRT and a CRT. Ore testing program (CRT) can 
describe how the State is doing with respect to its 
own curricular goals, the other (NRT) program can 
describe how children in the State are achieving 
relative to all children in the country ."^^ 

How Much is Enough? Setting Standards 

It can be dif ficuh to evaluate what eidier a CRT or 
NRT score means witiiout reference to some stand- 
ard or decision about how much is enough. If a child 
hail mastered 70 percent of a given skill, how is slie 
doing? This score means something different to her 
teacher if most other children in her class know 100 
percent than if most know SO percent. Or if the 
school district expects 100 percent mastery of this 
skill in first grade or fifth grade. Often, therefore, 
cutoff scores are set to establish mastery levels. 

In discussions of testing, this represents the more 
technical meaning of the word * * standard/*"^ In this 
case: 

... a standard is an answer to the question '^How 
much is enough?** There are standards for many 
kinds of things, including the purity of food prod- 
ucts, the effectiveness of fire extinguishers and the 
cleanliness of auto exhaust fumes. When you choose 
a passing score, you are setting a standard for 
performance on a test.^^ 

The most familiar testing example comes from 
minimum competency testing; a passing score is set, 
based on some criteria for competency, above which 
students are certified and below which they are not. 



The answer to ''how much is enough?** is ahnost 
always ''it depends/* How safe is safe enough and 
how clean is clean enough are issues that have 
occupied consumer safety and envirormiental pro- 
tection advocates and policymakers for years. Choos- 
ing a passing score on a test is rarely clear-cut. Any 
standard is based on some type of judgment, bi 
testing, the choice of a passing score or scores 
indicating levels of proficiency will be largely 
reliant on judgments. In testing, "... it is important 
that these judgments be: 

1. made by persons who are qualified to make 
them; 

2. meaningful to the persons who are making 
them; and 

3. made in a way that takes into account the 
purpose of the test.**"*® 

Because of the error inherent in any individual test 
score, however, it is virtually impossible to choose 
a passing score that will eliminate mistakes or wrong 
decisions. Some test takers will pass when they 
should have failed and some will fail when they 
should have passed. When setting passing scores or 
standards it is important to consider the relative 
likelihood, importance, and social value of making 
both of these kinds of wrong decisions.^^ 

A second, more general use of the term standard 
is also being employed in many of the current 
discussions about testing. 

As the histoiy of the word reminds us, a "stand- 
ard** is a set of values around which we rally; we 
"defend* ' standards. (The "standard** was the flag 
held aloft in battle, used to identify and orient the 
troops of a particular king.). . . Standards represent 
* . . desirable behaviors, not the best typical be- 
havior.^ 

This meaning of standard draws more from the 
dictionary definition of a standard as "... some- 
thing established by authority, custom, or general 



^See Linn* op. cit.. footnote 41. pp. 181-198, for further discussion of various options by which State and national comparisons might be made. 
^^See also the profiles of Arizona and Kentucky State testing programs in cIl 7. 

^Webster's detines tl is meaning as **. . . something set up and esublished by authority as a rule for the measure of quantity, weight, extent, value 
or quality.'* Webster* s Ninth New Collegiate Dictionary (Springfield, MA: Mcrriam Websta, 1988), p. 1 148. 

^'^Samuel A. Livingston and Michael J. Zieky, Passing Scores: A Manual for Setting Standards of Performance on Education ul and Occupational 
Thsts (Princeton, N7: Educational Tbsting Service, 1982), p. 10. 

^Ibid., p. 12. 

^^or analysis and discussion of technical problems in the setting of cutoff scores see, e.g., Robert Ouion, '^Personnel Asse5<;inent, Selection, and 
Placement, ' * Handbook of Industrial and Organizational Psychology, vol . 2, M. Dunnette and L. Fough (cds.) (Palo Alto, C A: Tansulting Psychologists 
Press, 1991), pp. 327-397. 

*ani Wiggins/* 'Standards' Should Mean 'Qualities/ Ns)t QuanUties/' Education Week, vol. 9. No. 18, Jan. 24, 1990, p. 36. 
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consent as a model or example/'^^ A standard, in 
this sense, is an exemplar — 'S,.waethcr few, 
many, or all students can meet or choose to meet it 
is an independent issue. . . • 

An example of this kind of standard that is now 
widely cited is the Curriculum and Evaluation 
Standards for School Mathematics prepared by the 
National Council of Tfeachers of Madiematics (NCTM). 
This document contains a series of standards in- 
tended to be criteria against which schools can judge 
their own curricular and evaluation efforts. For 
example, the first standard reads as follows; 

Standard 1: Mathematics as Problem Sowing 

In grades K-4, the study of mathematics should 

emphasize problem solving so that students can — 

♦ use problem-solving approaches to investigate and 
understand mathematical content; 

'''formulate problems from everyday and mathe- 
matical situations; 

♦ develop and apply strategies to solve a wide variety 
of problems; 

♦verify and interpret results with respect to the 
original problem; 

♦ acquire confidence in using mathematics mean- 
ingfully.^^ 

The specifics about how to test or assess this 
standard or about **how much is enough?** are not 
specified in the NCTM document. Instead it pro- 
vides a common framework and a set of exemplars 
toward which educators and students can work — 
such standards describe what optimal performance 
looks like and what is desirable for students to know. 
Without clear standards for perfom^unce, many 
students are left struggling to understand the criteria 
on which they are being evaluated. Box 6-F, 
excerpted from a contemporary play, highlights one 
aspiring athlete's struggle to ascertain the criteria or 
standards by which his performance as an athlete is 
being judged. Box 6-G describes some of the issues 
involved in setting and maintaining standards. 

What Should the Tests Look Like? 

Currently ahnost all group-administered stand- 
ardized achievement tests are made up of multiple- 
choice items; increasing dissatisfaction with multiple- 
choice technology as the single method for assessing 



Box 6-*F— Helping the Student Understand 
Expectations: The Need for Clear Criteria 

The need for explicit standaids and criteria in 
learning k aptly described in this letter exceqpted 
fsom the play Love letters, llie letter is written by 
a teen*^e boy about his perfotmance in ciew* 
rm MtsdUag the 4tfa acw new. Yeiienlty, I 
mved ticmdmr 2 M die 3iil Tbi^^ 
nu2nber$<mtibio2tidoriiumber4oiithe4th. Who 
knows? You ott the^m aad woric your butt off, 
and tto la»ii(& doQg^ 
and the next di^ they poitalitt on ttie bulletin bo^ 
sayiDg who wiU low what They never ten yott what 
you did ligitt w wrong, whetfier you'ie ibootiqg 
your sUdte or bttidiqg your back cr iidiit m 
poM die liMt mute for dl to tee. Some daytltfiink 
rm doins leally wdl« and I fot sent down two 
caews* One dsylwas obviously h«(UQg around, snd 
diey moved me UP. Time's no d^ms or leasoQ. I 
wem to Mr. Ghttk irfio is die head of lowiqf and I 
said, **Look, Kfr. Claik. There's somebiof wioog 
About diis system, Peqpksiecoostandy moving iq> 
and down and no one knows wlqr* It doesn't seem 
to have aiq^ 41 to do widi vrtiedaer you're good or 
badt streiv or wedc, coofdinated or uncooidinated 
K aU seems fandom. sir.'* And Mr. Oadc saM 
'That's life, Ancfy/' And walked away. WeU 
maybe dial's life, but it doesn't Aav» to be life. You 
could easily make ruks whkh made sense, so die 
good ones moved 19 and die bad ones moved down* 
and people ibttfH' what was going on. I'm serious.^ 



IProm Love Letters, a play by A JR. Oomey. 



achievement has led to considerable current experi- 
mentation with other item types and testing formats. 
Although the pros and cons of multiple-choice items 
are being widely and hotly debated, this testing 
format has many valuable characteristics. 

The multiple-choice item has achieved virtual 
dominance of die large-scale testing market primar- 
ily because of its psychometric and administrative 
properties. Although expensive and difficult to 
develop, multiple-choice items are efficient to ad- 
ministei and score, particularly when items and 
answers are kept secure. Large numbers of students 
can be tested simultaneously and their tests scored 
and returned within a relatively short period of 



^^Webster's Ninth New Collegiate Dictionary, op. cit., footnote 46. 
^^Wiggins. op, cit.. footnote 50, p. 25. 

^National Council of Ttachcw of Mathematics, Curriculum and Evaluation Standards for School Mathematics (Restou, VA: 1989). p, 23. 
^ typical standardized achievement test battery can be scored and reported back to school" in about 6 weelf s 

ERIC 
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time.^ These tests can also be administered without 
any special training or equipment* The answers can 
be objectively scored — thus seeming to avoid any 
judgment or subjectivity in scoring and potential 
controversy that might result. 

The measurement properties of multiple-choice 
items also make them very efficient. Many items can 
be administered in a relatively short amount of 
testing time, providing much information and mak- 
ing composite scores highly stable and reliable. The 
lar]e number of items also allows each content 
domain assessed to be represented by multiple 
questions, which increases both the reliability and 
validity of the test* Because large numbers of items 
can be pretested efficiently, a large pool of good 
items with empirical description of their difficulty 
levels (and other item parameters of concern in the 
design of tests) can be developed* Items in this pool 
can also be tested for statistical evidence of bias. 
Finally, multiple-choice items have been found to 
perform as well as other, less efficient kinds of items 
(e.g., essays) for specific functions such as predict- 
ing freshman college grades*^^ The dominant testing 
technology of the present— multiple-choice items in 
a nomi-referenced test — has been shown to be a very 
efficient technology for some specific purposes, in 
particular those purposes that require lanldng indi- 
viduals along a continuum. However, this is only 
one of many educational uses for achievement tests. 

The educational advantages of multiple-choice 
items, the ways in which they enrich or enhance 
learning, are harder to articulate. Historically, edu- 
cational examinations consisted of oral or written 
questions used to demonstrate mastery of content 
taught. Most other industrialized countries do not 
use multiple-choice examinations in education.^ 
Multiple-choice items wore pressed into service in 
this country when more efficient methods of testing 
large numbers of students were needed (see ch. 4), 
Each step in the historical process of examining — 
from oral to written examinations, then from written 
to multiple-choice — has taken us farther away from 
the actual skills, such as oral and written expression, 
that we want children to develop. Critics of multiple- 
choice items argue that we spend considerable time 



Phoh cndh:Bob Dmmnmrich 

These elementary echool students are taking a multiple- 

choloe achievement test that requires filling in the 
correct "bubble" on a separate answer sheet. Although 
such tests have certain advantages, many educators 
believe that negative effects on classroom praotloe 
Indicate a need for new testing approaches. 



training students in a skill not required in life, 
namely answering multiple-choice questions. As 
one analyst has observed: . most of the impor- 
tant problems one faces in real life are ill-st!:uctured, 
as are all the really miportant social, political, and 
scientific prob^ 'ms in the world today. But ill- 
structuied problems are not found in standardized 
achievement tests."^^ Many educators are now 
arguing that achievement tests need to consist of 
items and taskf; that are more ''authentic" — i.e., are 
made up of skills that we actually want children to 
practice and master, such as producing and explain- 
ing how they reached the answer, writing a logical 
argument, drawing a graph, or designing a scientific 
experiment. These efforts are described at length in 
the next chapter. 

One of the consistent themes of the debate 
throughout the last 70 years has been to ask whether 
more open-ended items (e.g., essays) really measure 



^See, e,g,, Brent Briiigcman and Charles Lewis, ' 'Predictive \klidily of Advanced Placemetit Essay and Multiple-Choice Bxaminations," paper 
presented at die inaual meeting of the National Council on Measurement in Education* Chicago, JL, April ^991. 

^A major exception is Japan, which does as much (if not more) multiple-choice testing than does the United SUtes. See ch. 5 for discussion. 

57NormanFrcdericksea, 'llieRcalTbst Bias; Influences of Tbsling on Ttaching and Icmdo^,'' American Psycholo$isu vol. 39. No. 3. March 1984. 
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Box 6-G-Setting and Maintaining Standards 

Few testt m Ihb coimtiy have attempted to provide inte^^ 
of peifonnance. Most judgiiimts about how well a child or school is doing have beu made througli the use of 
nonii»— essentially a standard based on average peifonnance. The cunent effort the National Assessment of 
Educational Progress to establish national proficiency levels of peifomiance—basic, proficient, and advanced— in 
mathematics is one such attempt.* 

Consider two different mediods that could be used by a teacher to grade the tests of his students. He could 
decide to grade them all relative to one another; in this method he looks over all the answers that have been provided 
and assigns the highest gnde to tfiose students widi the highest scores and the lowest gfsde to the lowM 
This is a nomi-reforfoced scoring system. Several probtems arise with this system. Pint, there is no objective 
referent— all of his students* answers may still be better than the best answer g^en in die class next door. Second, 
all of his students may have mastered the material of interest; if all have mastered it the actual differences that 
underlie a high and a low score mean veiy Utde. and will reflect very fine-grained and periiaps unimportant 
distinctions fai their understanding. Thus, the drawbaclcof diis procedure is that a student's performance is evaluated 
solely with respect to the perfonnance of otfiers. 

TTie second mediod would be to judge the woric against some standard reflecting whu his students should be 
able to do. The teacher determines what an exceUent, an avenge, and a poor answer would look like. All students 
are dien judged rehttive to dut standard. This is how many teachers assign letter gnides. The most widely cited 
problem wiUi a standard-based scoring system is that it is hard to equate standaids across teachers. Different teachers 
hold different expectations for what dieir students should be able to do and what excellence looks like. However, 
reference to some absolute standard of proficiency is in many ways die most meaningfU kind of score, puticularly 
if one wants to compare progress across time or to raise the absolute level of achievement among students. 

Some educational examinations, paiticulariy in European countries, have attempted to set central standaids and 
have used various mechanisms to maintain die consistency of die standaids. hi Great Britain, for example, die new 
national assessment involves a system of moderation of teacher judgments; mitiaUy, teachen are trained to make 
judgments about student performance on a number of standardized tasks. During die admuiistration of diese tasks 
at die end of die year, a moderator is sent to die schools to observe teachers, rate a subsample of students widi die 
teacher, discuss discrepancies hi judgments, and in various odier ways maintain die consistency widi which die 
standaids are being applied by teachers hi die school.^ 



ISee ch. 3 for t Anther discuuioa of lUudard lettiiig by the Nalioaal Asieitment of Educrtioiua PiogWM (NAEP). 
^CiMxt BuntiU. Nitioiml Fteodatioii for EducMioiMl keteaich, Ijondoo. penooal commuiilcalioa. Fetonaiy 1991. See also DeMUtmeot 
^ MucMlon and Scieoce and the Webb OfRcc, National Currtcuhm Task Group on Aiussment and listing: A Report (London, England: 



different traits, skills, or abilities than multiple- 
choice items. As one reviewer states: 

The enduring question for the [multiple] choice type 
items is whether or not these seemingly artificial 
contrivances measure the same thing as the more 
''natural and direct** free-response types of item. 
Popular opinion on this question is rather well 
formulated and almost universally negative, i.e., the 
two types of items do not measure the same thing. 
One can hear multiple-choice and true-false ques- 
tions castigated in nearly any teachers' lounge in the 
country on a daily basis, and they are lampooned 
with regular frequency in cartoon strips But at 



the root of the question if whether free-response and 
choice-type tests are measuring the same thing (trait, 
ability, level of knowledge) is an empirical one, not 
a philosof^ical or polemical one.^^ 

Few data are available comparing the extent to 
which tests in different formats provide the same 
information or different informatioi Results of a 
few studies that shed light on this topic are some- 
what mixed. In some areas, the research evidence 
suggests that mulriple-choice and open-ended items 
measure essentially the same skDls.^^ However, 
other research suggests that the extent to which 
open-ended or multiple-choice tests get at different 



^ThoDias P. Hogan, University of Wisconsin, Green Bay, ••Rclationskip Betwc^^ii Frec-Rcsponsc and Choice-Type Tbsls of Achievement: A Review 
of the Literature.' ' paper prepared for the National Assessment of Educational Progress, 1981 . 
Q Ibid.; and Millman and Greene, op. cit., footnote 24. 
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SUnilaity, tfie fotematioiitl Biccilauitfite Pirogiim has been devek^ to oottfer a degroe on high sdioda 
Btudenu woildwide. This pir^iw 
United States* In ofder to matetib the oompi^ 
piQgram has a veiy detaUed set of 
edteda for giidhig and judgim I^on^^ 
approved by the ccptialadniiiKts^ 
several stiidem e3iaminatios»--<me sec^ 
the central admfaiistntWe piognm where standaids fiv 
teacher if his gradhig standards aie not in Une with die central program standsid^ 

Recent developments hi psychometric theory and its application to luge-scale achievement testhig also 
provide some encoorsgfaig evidence of die pos$ibilhy of a 
to a cominon scale* Group-level item-response theory may 
items could be created for differtm States or distri^ 

would include a sufficient number of tfiese items so tfutt die rest erf their test could be calibr^tted to national norms 
or standards/ Such a model still requires, however, some degree of consensus about die content and cunicular areas 
to be tested 

'*TrustwoEthy conq^indve dau ... demands a desi^ 

be adueat to local control. It is one diing to sgiee dutt aridir^^ 

indude applications of concq>ts such as ratios and peictiiu^ 

the assessmem of specific sidlls such as tSiese shodd talce place or on d^ 

For subjects such as literature— what boolut should students read and at what age?--or social studies, these issues 
become even more thorny. 



^Carol NL Dahlbeig, coofdlnator, Inteniational biccalauieate PiogrMU, Mootgomay H(gh SchcK>l, Rockville* MD^ remarks at OlA 
Workibop on RiaminaHon Systems in Otiher Couotriet and Lessons for die U.S.» Mar. 27-28. 1991. 

^Robert L. Unn, **AccottntabUity: The Comparison of Edocationa] Systems and Uie Qiiality of Results,* ' Educational Policy, vol. 
1. No. 2, Inne 1987, pp. 181-198; and R. Danell Bock and Robert J. Mlslevy, "Comprefaenstve Educational Asaessment for ihe States: Hie 
Duplex Desisn,*' CRESST Evaluation Comment, November 1987. 

^Linn, op. cit, footnote 4, p. 196. 



skills will depend on the subject matter being tested. 
Evidence io strong, for example, that essay tests of 
writing provide different information than do multiple- 
choice tests of writing.^ In part, the potential 
usefulness of open -ended items will depend on the 
purpose of the particular test and ihe kind of 
information needed. 

Multiple Choice: A Renewable Technology? 

Because of concerns related to efficiency, reliabil- 
ity, and economy, many researchers and test devel- 
opers think that the multiple-choice test will proba- 
bly always have some role to play in the assessment 
of achievement. Therefore, educators and psychom- 
etricians have become interested in exploring ways 



to improve tlie multiple-choice items that currently 
dominate standardized achievement tests. A number 
of State assessment programs have put efforts into 
developing multiple-choice items that seem to 
require more complex thinking skills and are more 
consistent with their changing educational goals. 

For example, Michigan recently decided to move 
away from an exclusively skill-based approach to 
reading. New statewide reading objectives were 
developed consonant with a redefinition of reading 
as a process that involves constructing mei^ning 
through a dynamic interaction between the reader, 
the text, and the context of the reading situation. A 
new approach to assessing these goals was also 
needed, so the State embarked on developing new 



Traub, •'On ihc Equivalence of the Traits Assessed by Multiple-Cboice and ConsUuclcd-Rcsponsc Tbsts/* Construction Versus Choice in 
Cognitive Measurement, R J3. Bennett arl WC. Ward (eds.) (Hillsdale. NJ: L. Erlbaum Associates* in press); and Edys S. Quellnudz, "Designing 
Writing Assessments: Balancing Fairness. Utility and Cost/ * Educational Evaluation and Policy Analysis, vol. 6, No. 1 1 spring 19S4. pp. 63-72. It should 
also be noted that much of the research that does exist about item differences has been based on college or college-bound students and ' . . hence those 
of (a) above average ability, (b) beyond the years of rapid cognitive development, and (c) from predominantly middle-class* White* Western cultural 
background.** Hogan« op. cit., footnote 58, p. 46. Some of the field studies conducted as part of the National Assessment of Educational Progress can 
Q will provide much needed data about the performance of a diverse population of elementary and secondary students. 
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tests to be used with grades 4, 7, and 10. Michigan's 
innovative reading assessment program involves 
many changes in the tests— including the use of 
stories drawn from children's literature and other 
primary sources instead of short excerpted passages 
or ones written for the test— while still employing a 
multiple-choice format for answering the questions. 
Such questions are designed to assess "constructiug 
meaning" and "knowledge about reading" as well 
as factors typicaJly not tested such as a child's 
familiarity with the topic of the story and his or her 
effort and interest in the testing questions.^' 

A point that is consistently made by those who 
design educational tests is that multiple-<;hoice 
items are not restricted to assessing only basic skills 
or the memorization of facts.^ Multiple-choice 
items, if carefuUy crafted, can be used to assess veiy 
high levels of expertise— for example in admissions 
tests for graduate education (Law School Admission 
Tfest, Graduate Record Exam) and board certifica- 
tioii examinations for physicians. The ACT Science 
Reasoning Tfcst, which is part of the ACT used for 
college admissions, uses multiple-choice items to 
assess interpretation, analysis, evaluation, reason- 
ing, and problem-solving skills required in the 
natural sciences. Each unit on the test presents 
scientific information— in the form of graplis, re- 
sults of experiments, or descriptions of conflicting 
scientific theories— that the student must interpret. 
Accwding to the test designers, advanced knowl- 
edge in the subjects covered by the test (biology, 
chemistry, physics, and the physical sciences) is not 
required; instead the test emphasizes scientific 
reasoning skills.<^3 jhe National Assessment of 
Educational Progress (NAEP) has also put consider- 
able effort into developing multiple-choice items to 
measure thinking skills such as solving problems 
and conducting inquiries in science, conceptual 
undastanding and problem-solving in mathematics. 



and evaluating information and constructing mean- 
ing in reading. See figure 6-4 for examples of items 
drawn from these and other multiple-choice tests 
designed to assess more complex thinking skills. 

Recent research and development efforts have 
suggested additional ways that multiple-choice tests 
might be designed to reflect complex processes of 
learning and development: 

• One effort to assess science understanding has 
focused on trying to describe the various 
"mentii models" that children hold before 
they master the correct understanding of basic 
scientific principles. Multiple-choice items, 
such as the one in figure 6-5, are then designed 
to represent these various mental models; each 
distractor (or incorrect choice) represents a 
commonly held misconception about a scien- 
tific principle. Wrong answers can be examined 
by the teacher to discern what misconceptions 
each child may hold and better focus instruc- 
tion.^ 

• Similarly, if free-response answers gi /en by 
children to all kinds of open-ended tasks car be 
analyzed, then the kinds of misunderstandings 
and errors commonly made by children can be 
described. This information can be used to 
write distractors that reflect these errors (not 
just to ' • trick' ' students) and may then be useful 
in diagnosing mistakes and error patterns. 

• Researchers for some time have explored ways 
of giving partial credit for partial understanding 
on multiple-choice questions. One method of 
doing this involves giving different weights or 
points to different answers that arc written to 
reflect incorrect, partial, and complete under- 
standing of the solution. Partial credit scoring 
procedures are particularly relevant for diag- 



J^Tn ' ^"T"^"*' ""^ ^ P"**: Mil^ 0'«nc. op. cit.. footnote STTu^TyS^'WriS 



lA: 



vJ^'^r "^"^ '''■'PO'"^"" Manual for Teachers and Counselors Oowa City. 
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Figure Multlpie-Cholce Items Designed To Measure Complex Thinking Skills 



Thinking Skill:*" 

Knowing Science 

Grade Levels: 4, 8, 12 

Always Sometimes Never 
True True True 

Scientists should report 

exactly wt)at they observe .... * 
Belief Is the main t)asls for 

scientific knowledge * 

Knowledge Is the goal of scientific 

work * 

Sclent! lie knowledge can be 

questkMied and ch anged * 

Knowledge discovered In the 

past Is used In current scientific 

work • 

Scientists who do experiments find 

answers to their questtons .... * 



Grade Level: 4 

The methods of science can be used to answer all of the 
following questions EXCEPT: 

*(A) Are puppies more beautiful than spklers? 

(B) How many oak trees grow In Pennsylvania? 

(C) Which laundry detergent cleans best? 

(D) What are the effects of lead pollution on trout? 



Thinking Skill:^ 

Applying Principles 

Grade 8 

If the law of supply and demand wo rka, the farmer 
will obtain the highest price for crops when 

A. both supply and demand are great . 

B . both supply and demand are low . 

C. supply is great and demand is low. 
*D. supply is low and demand is great • 



Thinking Skill:'' 
Summarizing Ideas 

Read the seoteitce. Theo choose the essential phrase that should 
be inchided in vcsearch notes for a paper on the subject 

Despite the fact that Puritan forces in England objected to plays and 
tried to interfere with performances, thea^xica] c it^rlainment enjoyed 
great popularity in Shakespeare's time, both with the public and with 
the members of the royal court. 

A royal court enjoyed plays during Shakespeare's time 
^ B plays popular despite objection and interference by Puritans 
C theabical entertainment very popular with the public 
D Puritans object to public performances 



Thinking Skill:'' 
Comprehension 

Read the question and then choose the best answer. 

Which of these is most like an excerpt from a myth? 

^ A And so the turtnilent sea suddenly grew cahn as Father 
Neptune urged his steeds forward and flew off toward the 
setting sun. 

B Gold coins were reported to have come from an ancient 
Phoenician ship that sank off the island during Homeric times. 

C We lowered the sails but the Moon Goddess still lurched 
violently on the crashii^ waves as we prepared to ride out the 
storm. 

D Retrace the voyage of Ulysses in a 2 l>day adventure that takes 
you from Asia Minor to the islands and mainland of Greece. 



* Correct answers for multlple^holce Items are Indicated by an 
asterisk (*). 

^SOURCE: Natlom J Aeseeement of Educational Proflrets, Sdenoe Ot^ecffvee; IQQOAaMssnwnt, booklet No. 21-S-10 (Princeton, HJ: 1989), pp. 45^6. 

^SOURCE: Connectlcot State Department of Education, Connectfcut Assessment of Educational Progress 1902-83: Sodal Studies Summary and 
Interpretations Report (Harthrd, CT: 1964). 

csoURCE: CTB/McGraw-HIH, Comprehensive Test of Bash Siais (CTPS) Class Management Guide: Using Test Results (Monterey. CA: 1990), pp. 6b. 70, 
^ PI These are sample Items that do not appear on an actual test 

tRlL 
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Figure 6-5— Sample Multiple-Choice Hern With Alternative Answers Representing 

Common Student Misconceptions 



A spaceship is drifting sideways in space from point A to point B; it is not affected by outside 
forces. At point B. its engine fires to produce a constant thrust at a righ( angle to AB. At point 
C, the engine is shut off again. 




ce 



Which of the following (1. 2. 3. 4. or 5) best represents the path of the spaceship? 





The correct answer is. 



NOTE: The alternatlvM proMnt«d represent both the fvrrecX n cental model of the effect of forces on a spaceship and 
a variety of poaalble answers based on knowri, erroneous mental models that children hold. 

SOURCE: R J. Shavelson, N.B. Careyp and N.M. VVebb, "Indicators of Science Achievement: Options for a Powerful 
Policy lnstnimer)ts/* Phi Mta KappoPt vol. 71 , No. 9, May 1990, p. 697. 



nostic tests designed to describe a student's 
strengths and weaknesses.^^ 

The complex multiple-choice item is a widely 
used format in medical and health professions' 
testing programs where many questiors have 
more than one right answer. In this item type, 
four or five answers are presented and the 
student can select any number of correct 
responses from none to all.^ 

Another way that multiple-choice it^ms can be 
used to measure more complex understandings 
is to group a series of them together based on a 



common set of data. The data may be in the 
form of charts, graphs, results of experiments, 
maps, or written materials. Students can be 
asked *\ . Ao identify relationships in data, to 
recognize valid conclusions, to appraise as- 
sumptions and inferences, to detect proper 
applications of data, and the like."^'' 

Redesigning Tests: Function Before Form 

Ibst use in schools has been increasing. Much of 
the increase in the volume of school-based testing in 
the last decade has come from its rising popularity as 



^Millman and Green, op. cit., footnote 24; Thomas M. Haladyna, **Tbe Effectiveness of Several Multiple-Choice Formats/* Applied Measurement 
in Education^ in press. For a discussion of ways in which test theory will have to develop and change in order to accommodate the measuren^nt of 
problem-solving strategies and misconceptions see Robert J. Mislevy, Foundations of a New Test Theory^ ETS Research Report RR 89-52-ONR 
(Princeton, NJ: Educational Ibsting Service, October 1989). 

^Haladyna, op . cit., footnote 65 . This item type has been fotmd to have a number of technical problems. Haladyna recommends the related five-option 
"multiple tnie-false" item. 

^'^Cironlund and Linn, op. cit., footnote 16, p. 193. 
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Table 6-3— Functions of Tosts: What Designs Are Needed? 



Classroom 

Instmctlonal guldanoe 

Who needs to t)9 descrlt)ed Individuals 

**StakBs" or consequences attached Low 

Characteristics of the test needed 

Comparability of Information Low 

Impartial scoring (not teachers) No 

Standardlzedadmlnlstratlon No 

Type of Information needed 

Detailed v. general Detailed 

Frequency Frequentiy during a 

single school year 
Results needed quIcKly Yes 

Technical requirements 
Need for high test reliability (Internal 

consistency and stability) Can vary 

Type of validity evidence Content 



System monitoring 



Selection, placement, 
and certification 



Groups of students 
High or low 

High 
Yes 
Yes 

General 
Once a year or less 

No 



Depends on size of group 
if low stakes: content 
if high stakes: content 
and construct 



Individuals 
High 

High 
Yes 
Yes 

General 
Onceayoaror less 

No 



Very high 

Content 
Addltk)nal valkilty evidence 
nnust be denrwnstrated for 
the specific purpose (e.g., 
certlftoatton - criterion 
valMlty, selection - predto- 
tlve valkilty) 



SOURCE: Of f of Technology AssMsment, 1 902; adaptsd from i^ur«n B. Resnfck ar)d Daniel P. Resntek, "Assessing the Thinldng Currk^ulum: New Toob 
for Educational Retormr paper prepared for the National Commission on Testing and Put)lk: Polky* August 1989. (To appear In B.R. Qlfford and 
M.C. Connor (eds.), Future Ass^asmMts: Changing Wews of Aptitude, Achievement, and Instruction (Boetoni MA: Kluwer Academk; Puk)li8hers» 
In press).) 



an accountability tool for policymakers interested in 
a measure of system effectiveness (see ch. 2). The 
available testing technology — norm-referenced mul- 
tiple-choice tests — has been pressed into service 
even when Uie properties of this technology were not 
well matched to the needs of the users. Similarly, 
there has been increasing interest in the role that :ests 
can play in fostering learning and knowledge 
acquisition in the classroom. For tects to have 
educational value to the student in the classroom, 
educators argue, the tests must be frequent, provide 
feedback in a timely fashion, and make clear the 
expectatiofiS and standards for learning. A single 
testing technology no longer seems enough for the 
needs of multiple users. How, then, should we 
redesign achievement tests to better serve multiple 
testing needs? 

Table 6-3 summarizes the characteristics of tests 
required for each of the three main functions of 
testing. Consider first the system monitoring func- 
tion of tests. In thi^ case only groups of students need 
to be described, that is classrooms, schools, districts, 
or States. Individual scores are not needed. This 
means that sampling methodologies can be used — a 
representative subset of students can be tested and 
O^^curate information obtained. One of the advan- 
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tages of a sampling methodology is that no individ- 
ual scores are available, thus preventing their use for 
uninf "Tided puq)oses such as selecting students for 
special programs or grouping students according to 
ability. One of the drawbacks sometimes cited for 
sampling, however, is that students may not be 
particularly motivated to give their best performance 
when they are not going to receive personal scores 
(see ch. 3). 

In system monitoring, managerial uses can in- 
clude information that has both high and low stakes. 
Purely informational uses (without consequences) 
may include program evaluation and curricular 
evaluation. Similarly, some administrators may 
want information about how their system is doing 
but may not attach any particular rewards, sanctions, 
or expectations to the test scores; test results would 
have a ''temperature taking*' function. NAEP an 
example of a test designed to provide nationally 
representative information of this type. However, 
increasingly tests are being used for accountability 
piuposes — rewards and consequences are attached 
to the results of those tests and they are being used 
as a lever to motivate improvement. When this 
happens, the informational value of the test can be 
compromised. Attention is readily focused on test 

2U5 
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perfonnance as a goal of instruction; in this case 
improvement in test scores may or may not signal 
growth in real achievement.^ 

Many of the characteristics of tests designed for 
monitoring systems are those expected from stand- 
ardized achievement tests. It is very important that 
the results obtained from these tests be comparable 
across students and that they can be aggregated in a 
meaningful way. This means that the tests must be 
standardized in administration and scoring. Impar- 
tial scoring is very important. The monitoring of 
systems requires general information at occasional 
intervals (usually once a year or less). The results are 
not needed immediately. 

Ibsts used for selection, placement, or certifica- 
tion differ from tests used for system monitoring in 
several major ways. First, each student must receive 
a score. Second, the kinds of decisions these tests are 
used to make are ahnost always high stakes — they 
can have significant consequences for an individ- 
ual's educational career. Ibsts used for selection, 
placement, and certification must meet exception- 
ally high standards of comparability, reliabiliiy, and 
validity. As with tests used for monitoring systems, 
impartial scoring and standardized administration 
are required; similarly the information required is 
general, needed infrequently (once a year or less) 
and not required quickly. 

The third major difference is in the kind of 
validity evidence required. Tbsts for selection, 
placement, or certification must be validated for 
each of those specific uses. Thus certification tests 
need criterion-related validity evidence particularly 
related to the cutoff scores*' that are established to 
certify mastery. Selection tests need predictive 
validity evidence demonstrating that test results 
relate to future performance or ability to benefit from 
a particular resource or intervention. In the current 
debate about redesigning tests, there is little discus- 
sion by educators or measurement specialists about 
needing or using various new test designs for 
selection. In part, this may be due to a fairly 
widespread and entrenched belief that selection tests 
are not appropriate for elementary school and, for the 
most part, not within secondary school either.^^ 



Ibsts designed for classroom use are the most 
divergent in their design requirements (see table 
6-3), differing significantly both from existing and 
new tests designed to serve managerial functiont^. 
Ibsts used by teachers to monitor learning and 
provide feedback need to provide detailed informa- 
tion on a frequent basis, as quickly as possible. 
Because classroom tests are very closely related to 
the goals of instruction, time spent on testing need 
not be considered ^ Vasted time." As testing at the 
classroom level becomes more integrated with 
instruction, the time constraints so often imposed on 
tests can be relaxed considerably because time spent 
on tests is also time spent learning. Because these 
tests do not cany high stakes and because they are 
not going to be used to make comparisons among 
students or schools^ they are free of many of the 
stringent requirements of standardization, impartial 
scoring, and need for comparability. However, the 
more that teachers or school sy.stems want these 
classroom level tests tc be useful for other purposes, 
i.e., to make high-stakes decisions about individuals 
or to aggregate the information across classrooms or 
schools, the more that these classroom tests wiU 
need to incorporate features that provide compara- 
bility and standardization. It is difficult to prevent 
the misuse of information once that information 
has been collected. One of ihe dangers, therefore, 
in relaxing technical standards for classroom 
tests is that the use of the s<:ores cannot be 
restricted or monitored appropriately once they 
are obtained* 

How can the various functions of testing and 
design requirements be coordinated with one an- 
other? Most investigators working in test design 
today believe that one test cannot successfully serve 
all testing functions. 

Many of the features of tests that can effectively 
influence classroom learning are very different from 
the requirements of large-scale matiagerial testing. 
Many testing experts believe that we need two 
distinct types of tests to serve these two functions 



^For a discussion of the "Lake Wobegon Effect** and other evidence about how gaijns in test scores can be attained without afr<jcting **rcal 
achievemejit,** see ch. 2, 

O ^^Haiiey, op. cit., footnote 35. 
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because the requirements are so divergentJ^ The 
Pittsburgh school district^ for example^ has devel- 
oped a diagnostic testing system, called Monitoring 
Achievement in Pittsburgh (MAP), which is charac- 
terized by tests closely aligned with curricula, brief 
and frequent administration of those tests, and rapid 
turnaround of results. These test results are then used 
to inform i^istniction, as teachers can see whether an 
objective that has been covered has, in fact, been 
learned by the class and tailor instruction accord- 
ingly. Pittsburgh uses a different test for system 
monitoring; analyses have suggested that recent 
gains on this traditional norm-referenced test are 
largely due to the effects of MAP.''^ 

Conclusions 

No testing program operates in a void. The effects 
of any testing program on the school system as a 
whole, or of dUSerent tests on one another, need to 
be continually monitored. The effect ot other testing 
requirements, imposed by the State or a special 
program such as Chapter 1, may also affect the 
impact of a new test or new reform program. The 



consequences of a given test—to the individual 
student, the teacher, the school— will heavily influ- 
ence the effects of that test on learning and 
instruction. A beautifully designed and education- 
ally relevant test may have no impact if no one looks 
at its scores; the poorest quality test available could 
conceivably influence much of a school's educa- 
tional climate if the stakes attached to it are high. 

What a test looks like — the kinds of tasks and 
questions it includes — should depend on the in- 
tended purpose of the test. As the next chapter will 
illustrate, test formats can vary widely from multiple- 
choice to essays to portfolios. Different types of 
testing tasks will be more or less usefril de^^nding 
on the purpose of the test and the type of information 
needed. The purpose of a test and a definition of 
what it is intended to assess need to be carefully 
determined before test formats are chosen. More- 
over, critical issues such as bias, reliability, and 
validity will not be resolved by changing the format 
of the test. 



^aul O. LcMahieu and Richard C. Wallace. Jr.. **Up Against the Wall: Psychomedics MeeU Pmxis/* tii^icaHonai Measurement: issues atid 
PracHce, vol 5. No, 1. spring 1986. pp. 12-16; and Educational Tfcsting Service, **lnstmcUo:ial and Accountability Tfcstlog in American Education: 
Different Purposes. Different Needs.** hrorhure. 1990. 

''^UNfahieu and Wallace. op. cit. footnote 70. andPaul G. LeMahicu. ^'TlK^ 
O iltorlng Through Frequent Tbstlng.** E<* national Evaluation and Policy Aftalysis, vol 6. No. 2. summer 1984. pp. 175-187. 
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Highlights 

• Many school districts and States are turning to peifonnance assessment— testing that requires students 
to create answers or products tliat demonstrate what they know and can do— as a complement to their 
traditional testing programs. Thirty-six States now use direct writing samples, and 21 States use other 
types of performance assessment (in addition to writing samples) on a mandatory, vohmtaiy, or 
experimental basis. 

• Writing samples and constrocted-iesponse items, which require test takers to produce an answerrather 
dian select from a number of options, are die most common forms of performance assessment; oUier 
methods, such as portfolios ai student wo±, exhibitions and simulations, science experiments, and oral 
hiterviews, are still in their hifancy. 

• Altfaouglh perfomiance assessment methods vary, they share certain key features. They involve direct 
observation of student behavior on tasks resembUng those considered necessary in the real world, and 
they shed light on students' learning and thinking processes in addition to the correctness of their 
answers. 

• Pbrformance assessment methods must meet the challenge of producing reliable and valid estimates 
of student achievement before they can be used for high-stakes decisions involved in system 
monitoring or selection, placement, and certification. Ftocedures to reduce subjectivity and elimhiate 
error in human scoring have been developed and used widi some success in scoring essays and student 
writing samples. 

• Researchers are developing methods for machine scoring of constructed-response items. Ifest taking 
by computer is one approach. Odiers include having students fill in grids to answer mathematics 
problems or draw responses on a gra{di or diagnun. 

• Advanced information technotogies could significantly enhance perfomiance a; sessment methods: 
tracking student progress, standardizing scoring, presenting simulations and problems, video recording 
performance for later analysis, and training teachers are among the most promising possibilities. 

• Performance assessment is usually more expetawe in dollar outlays than conventional nuiltiple^hoice 
testing because it requires more time and labor to administer and scoie. However, these high costs 
might be balanced the added instructional benefits of teacher participation in developing and 
scoring tests, and 1^ die closer integration of testing and instruction in die classroom. 

• For performance assessment to become a meamngful conq>lement or substitute for conventi(mal 
testing, educatirg teachers and die general public will be critical. Ibachers need to learn how to use, 
score, and interpret performance assessments. The public, accustomed to data ranking 8tud<»its on 
norm-referenced, multiple-choice tests, needs to understand die goals and products of performance 
assessment. 

• Changing die format of tests will not by itself ensure diat die tests better meet educational goals. 
However, since what is tested often drives what is taugiht, testing should be designed to reflect 
outcomes that are desired as a result of schooling. 



Introduction 

Springdale High School, Springdale, Arkansas. 
Spring 1990. Instead of end-of-yef»r examinations, 
seniors receive the following assignment for a 
required "Final Performance Across the Disci- 
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Discuss behavior patterns as reflected in the insect 
worldi in animals, in human beings, and in literature. 
Be sure to include references to your course work 
over the term in Inquiry and Expression, Literature 
and the Arts, Social Studies, and Science. This may 
draw upon works we have studied, including Macbeth, 
Stephen Crane^s poetry. Swift's ''A Modest Pro- 
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posal'* and other essays, Mark Twain's fiction, 
materials fiom the drug prevention and communica- 
tion workshop, or behaviors you have observed in 
school. You may also add references to what you 
have read about in the news recently. On day 1 of the 
examination you will be given 4 periods in which to 
brainstorm, make an outline, write a rough draft, and 
write a final copy in standard composition form. You 
will be graded not only on how well you assimilate 
the material but also how well you reflect our 
' 'student as worker'' metaphor and how responsibly 
you act during the testing period. On day 2 of the 
examination, you will assemble in villages of three, 
evaluate an(Hiymous papers according to a set (tf 
criteria, and come to a consensus about a grade. Each 
paper will be evaluated by at least two groups and 
two instructors. Part of your overall semester grade 
will reflect how responsibly you act as a member of 
a team in this task.^ 

Constable Elementary School, South Brunswick, 
New Jersey. Fall 1990?- Every morning, between 
10:30 and 11:50, first grade teacher Sharon Suskin 
settles her class down to a quiet activity supervised 
by an aide while she calls one student at a time up to 
her table. Witli Manuel she says: **rm going to read 
you this story but I want you to help me. Where do 
I start to read? As the shy 6-yrar-old holds the book 
right side up and points to the print on the first page, 
she smiles and continues: ^'Show me where to 
start.'' She puts a check on her list if he begins at the 
top left, another if he moves his finger from left to 
right, another for going page by page. When it is 
Joanna's tum, she asks her to spell some words: 
**tnick," **dress," **feet.'' Mrs. Suskin makes a 
note that, while last month Joaima was stringing 
together random letters, she now has moved into a 
more advanced phonetic spelling— *t-r-k", **j-r-s'' 
and **f-e-t "—representing the sounds in a word. 
Mrs. Suskin spends anywhere from 2 to 10 minutes 
with each child, covering about one-half the class 
each moTiing, and files the results in each child's 
portfolio later in the day. When parents come m for 
conferences, out comes the portfolio. Mrs. Suskin 
shows Manuel's parents how far he has come in 
reading skills; Joanna's parents see records of 
progress rather than grades or test scores. Mrs. 
Suskin refers to the portfolio regularly, when group- 



ing students having similar difficulties, or when she 
wishes to check on special areas where an individual 
child needs help. It's a lot of work, she admits, but 
she says it gives her a picture of each child's 
emerging literacy. She laughs: ''It makes me put on 
paper all those things I used to keep in my head." 

All Over California, Spring 1990.^ AH LI million 
fifth, seventh, and ninth grade students in California 
were huffing and puffing, running and reaching. 
They were being tested in five measures of fitness: 
muscular strength (pull ups); muscular endurance 
(sit ups); cardiovascular fitness (a mile run); flexibil- 
ity (sit and reach); and body fat composition (skin 
fold measurements). Results were tabulated by age 
and sex, along with self-reported data of other 
behavior, such as the amount of time spent watching 
television or engaging in physical activity. The tasks 
and standards were known in advance, and local 
physical education teachers had been trained to 
conduct the scoring themselves. Hie results were 
distressing: only 20 percent of the students could 
complete four or five tasks at die acceptable" 
level. The bad news sent a signal to the physical 
education programs all over the State. Ibaching to 
this test is encouraged as schools work to get better 
resa ts on the next test administration, llie overall 
goal is more ambitious — to focus awareness on the 
need for increasing attention to physical fitness for 
all st'idents, and to change their fitness level for the 
better. 

V\ hy Performance Assessment? 

These vignettes are examples of performance 
assessment, a broad set of testing methods being 
developed and applied in schools, districts, and 
sometimes statewide. This concept is based on the 
premise that testing should be nir^e closely related 
to the kinds of tasks and skills chiMren are striving 
to learn. Emotionally charged ttmis have been 
applied to this vision of testing,. **Authentic." 
^^appropriate," **direct/' and even intelligent" 
assessment imply something pejorative about multiple- 
choice tests. This rhetoric tends to ignore that certain 
multiple-choice tests can provide valuable informa- 
tion about student achievement. OTA uses the more 



^Brown University. The Coalition of Essential Schools, Horace, vol. 1, No. 6, March 1990, p. 4. 

^romRuth Mitchell and Amy Stempcl, Council for Basic Education, * 'Six Case Studies of Performance Assessment,* ' CTA contractor report* March 
1991. 

'Dale Carlson, ' * What's New in Large-Scale Performance Ttsting,' ' paper presented at the Boulder Conference of State Tbsting Directors. Boulden 
O ;0, June 10-12, 1990. 



211 



Chapter 7— Performance Assessment: Metliods and Characteristics • 203 



neutral and descriptive term ^^performance assess- 
meni" to refer to testing that requires a student 
to create an answer or a product that demon- 
strates his or her knowledge or skills. 

The act of creating an answer or a product on a test 
can take many forms. Performance assessment 
covers a range of methods on a continuum, from 
short-answer questions to open-ended questions 
requiring students to write essays or otherv/ise 
demonstrate understanding of multiple facts and 
issues. Performance assessment could involve an 
experiment demonstrating understanding of scien- 
tific principles and procedures, or the creation and 
defense of a position in oral argument or comprehen- 
sive performance. Or it may mean assembling a 
portfolio of materials over a course of study, to 
illustrate the development and growth of a student in 
a particular domain or skill (see ch. 1, box 1-D). 

Whatever the specific tasks involved, this move 
toward testing based on direct observation of per- 
formance has been described by some educators as 
^ ^nothing short of a revolution'' in assessment.^ 
Given that performance assessment has been 
used in businesses and military training for many 
years, and by teachers in their classrooms as one 
mechanism to assess student progress, the real 
revolution is in using performance assessment as 
a part of large-scale testing programs in ele- 
mentary and secondary schools. 

The move toward alternative forms of testing 
students has been motivated by new understandings 
of how children learn as well as changing views of 
curriculum. Recent research suggests that complex 
thinking and learning involves processes that cannot 
be reduced to a routine,^ that knowledge is a 
complex network of information and abilities rather 
than a series of isolated facts and skills. According 
to 'his research, students need to be able to 
successfully engage in tasks that have multiple 
solutions and require interpretive and nuanced 
judgments. This kind of performance in real-world 
settings is inextricably supported and enriched by 
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Performance assessment often Involves direct observation 
of students engaged In classroom tasks. For example, 
examinations that require students to plan, conductt and 
de8crit>e experiments reinforce Instruction that 
emphasizes scientific understanding through 
hands-on activities. 

other people and by knowlf^ge-extending artifacts 
like computers, calculators, and texts.^ 

This view of learning challenges traditional views 
of how to structure curricula and teach, and therefore 
also how to evaluate students' competence. If 
knowledge is linked in complex ways to situations 
in which it is used, then testing should assign 
students tasks that requke interpretation and appli- 
cation of knowledge. If instruction is increasingly 
individualized, adaptive, and interactive, assess- 
ment should share these characteristics. However, 
educators trying to implement curricular innova- 
tions based on this more complex view of learning 
outcomes have found their new programs judged by 
traditional tests that do not cover the skills and goals 
central to then* innovations. Many say that school 
reform without testing reform is impossible. For 
example, the National Council of Ibachers of 
English recently warned that: '\ . . school restnic- 
turing may be doomed unless it helps schools move 
beyond the limitations of standardize-d tests. ''^ 



^Jack Foster, secretary for Education and Humanities, State of Kentucky, personal communication. Mar. 1 1 , 199 1 . 

^See also ch. 2; and Center for Children and Ibchnology, Bank Street College, **Ai olications in Educational Assessment: Future Ibchnologies/* 
OTA contractor report, February 1990. 

^Additional interest in increased teaching of more complex thinking skills comes not only because of disappointing evidence about students* abilities, 
but also because of the belief that all workers will require these adaptive capabilities, i.e., the ability to apply knowledge to new situations. 

^New York State United Ibachers Iksk Force on Student Assessment, ^'Multiple Choices: Reforming Student Ibsting in New York State/* 
'Q--"blishcd report^ January 1991, p. 12: citing the 1990 National Council of Tbachcrs of English, Report on Trends and issues. 
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Educators advocating performance assessment 
aie also inteiested in the possibility of making good 
assessment a more integral and effective part of the 
learning process. These advocates hope that stand- 
ardized performance-based testing can become a 
helpful part of classroom learning rather than a 
distraction or a derailment of classroom practices. In 
this view, time spent studying or practicing for tests, 
or even going through the tests themselves, is no 
longer seen as time away from valuable classroom 
learning but rather an integral learning experience.^ 

Indeed, some proponents of performance assess- 
ment suggest that its strongest value lies in how it 
can influence cuixiculum and instmction by model- 
ing desired educational outcomes. Although ''teach- 
ing to the test" is disparaged when a test calls for 
selection of isolated facts from a multiple-choice 
format, it becomes the modus operandi in perform- 
ance assessment. Perhi^s the prime reason for the 
popularity of performance assessment today stems 
from the idea that student learning should be guided 
by clear, understandable, and authentic examples 
that demonstrate the desired use of knowledge and 
skills. Assessment is then defmed as the tool to judge 
how close the student has come to replicating the 
level of expertise modeled in the exanq)les. The 
theory is that performance assessment is an effective 
method for clarifying standards, building consensus 
about goals, and delivering a more cohesive curricu- 
lum throughout a school system. 

As States and districts begin to change their 
educational goals and curricula, student assessments 
are also being revised to meet these changing 
standards and goals Educators have always recog- 
nized that traditional multiple-choice tests do not 
capture all the objectives valued in the curricula. 
Some testing programs have attempted to overcome 
this problem by incorporating some open-ended 
tasks. However, the increasing stakes attached to 
traditional test scores has given the tested objectives 
a great deal of attention and weight in classrooms, 
often at the expense of objectives that are valued but 
not directly tested. Policymakers have become 
interested in tests covering a much wider range of 



skills and educational objectives, and in various 
fomis of perfomiance assessment that can broaden 
educational outcomes. 

The real policy issue Is not a choice between 
performance assessment and multiple choice, but 
using tests to enrich learning and understand 
student progress. Embracing performance as- 
sessment does not imply throwing out multiple- 
choice tests; most States are looking to perform* 
ance assessment as a means of filling in the gaps. 
The skills that are not usually evaluated on multiple- 
choice tests — ^writing, oral skills, ability to organize 
material, or perform experiments — have 7>een the 
first candidates fcr pciformance assessments. New 
Yoik^s position is illustrative: 

Student peiformance assessments should be de- 
veloped as a significant component of the state's 
system of assessment. These assessments would 
include improved multiple-choice tests and incorpo- 
rate authentic * ^real-life' ' measures of student knowl- 
edge. Student perfcMrmance, judged against clearly 
defined standards of excellence, would better meas- 
ure the skills of critical thinking, reasoning, iiif(»rnia- 
tion retrieval and problem solving. Such perform- 
ance assessments could include portfolios, hands-on 
problem-solving projects, and demonstrations of 
ability and knowledge.^ 

State Activities in Performance 
Assessment 

State and local districts have rapidly adopted 
performance assessment for a range of grade levels 
and testing objectives. OTA estimates that, as of 
1991, 36 States were assessing writing using direct 
writing san^les (see figure 7-1); in addition, 21 
States had implemented other types of performance 
assessment on a mandatory, volimtary, or experi- 
mental basis^^ (see figure 7-2). At die present time, 
most performance assessments are on a pilot or 
voluntary basis tt the State level. When mandated 
statewide, performance assessments tend to be 
administered in one or two subjects at selected grade 
levels. 
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^This issue has important iinpUcadons for the estimation of costs associated with alternative testing programs. See discussion in cL 1. 
^ew York Sute United Uacheni Husk Force on Student Assessment, op. cit, footnote 7, p. 4. 

i<<>fnce of Ibchnology Assessment data, 1991. The cat^ory of writing assessmenu includes just those tests that evaluate student writing skills by 
asking them to write at some length (paragraphs or essays); other performance assessments reported by States included pc^tfolios, exhibitions or 
activiUes, and open*ended paper-and-pencii tests that include student-created answers. This last category includes student essays designed to test 
O lowledge on a particular subject, not testing writing skills per ^. 
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m Optional writing assessments (n"4) 

nzi Future plans to assess writing (n"9) 

CZ) No current or futu. ^ plans to assess writing (n*5) 



NOTE: "Futura ptans*" Indudts current pilot programs. 
SOURCE: Office of Tachnoiogy As&Msmant* 1992. 



Figure 7-2— Statewide Performance AssessmentSi 1991 




CZZD None (n-14) 



NOTE: Map indudst optional progranrM. 
SOURCE: Offico of Tachnology Assassment. 1992. 
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Seven States (Arizona, California, Connecticut, 
Kentucky, Maryland, New York, and Vermont*^) are 
moving tfieir educational evaluation systems toward 
perfomiance assessment, gradually reducing reli- 
ance on m nn-referenced multiple-choice te<sting. 
Each State has approached the change differently, 
but they view performance assessment as a tool not 
only for understanding the progress of individual 
students, but also for school, district, or State 
accountability. These State efforts will exert a 
tremendous influence as comparisons and rankings 
between schools develop, and policy decisions are 
made as a result of these new testing results. 

The variety of approaches in State testing policies 
stands in contrast to the traditional State processes 
for test selection. Historically, State departments of 
education selected tests with little or no input from 
teachers or the public. The testing division would 
invite publishers to bid on the development of a 
nomi-referenced or criterion-referenced test based 
on the State's curriciilum, or, more conunonly, shop 
around and then purchase '"off-the-shelf" tests such 
as the Iowa Ibsts of Basic Skills* Stanford Achieve- 
ment Tbsts, California Achievement Tfests, or other 
popular norm-referenced achievement tests.*^ This 
process is changing. 

The State profiles in boxes 7-A. 7-B, and 7-C 
provide a picture of how some States are moving 
toward greater use of performance assessment in 
their statewide testing programs. They illustrate the 
motivation behind these changes, as well as prob- 
lems and barriers States face in implementing these 
changes. 

The Many Faces of Performance 
Assessment: Forms and Functions 

Perfomiance assessment can take many forms. 
The central defining element in all performance 
assessment methods is that the test taker creates an 



answer or product to demonstrate knowledge or 
skills in a particular field. From paper-and-pencil, 
short-answer questions to essays requiring use of 
knowledge in context, oral interviews, experiments, 
exhibitions, and comprehensive portfolios with mul- 
tiple examples of a student's work over a period of 
an entire year or longer, each type has its own 
characteristics. Nonetheless, many characteristics 
are shared. Ttiis section describes some of the 
common forms of perfoi r^ce assessment used in 
K-12 schools today. It is followed by a section that 
sununarizes the common characteristics of perform- 
ance assessment. 

Constructed-Response Items 

Paper-and-pencil tests designed by teachers have 
long been a regular feature of the classroom; 
teachers typically employ a range of item types that 
include mathematics calculations, geometry proofs, 
drawing gr£^hs, fill-in-the-blank, matching, defini- 
ti(Mis, short writter inswers, and essays. Except for 
multiple choice and essays, few of these item types 
have been used for large-scale standardized testing 
programs, but test developers and educators have 
begun to consider this possibility. 

The term constructed-response (CR) :tem is 
commonly used to distinguish these items from 
items such as multiple choice that require selecting 
a response among the several options presented. CR 
items require students to produce or construct their 
own answers.*^ 

Several educational advantages might be gained 
by expanding the use of CR items.*"* First, they have 
higher face validities: they look more like the kinds 
of tasks we want children to be able to do. Second, 
ihese item types may do a better job of reflecting the 
complexity of knowledge, because they can allow 
partial credit for partial understanding. Third, these 
item types may enhance the reliability and validity 
of .scores because they eliminate guessing and other 



^ ^ VennoDt did not require statewide testing prior to 1990. The introduction of performance assessment through portfolios in mathematics and writing 
is the first mandated statewide testing. 

i^See ck 6 for further discussion of norm-referenced testing. 

i^A group of researchers at the Educational Ibstlng Service has attempted to describe a frameworlc for categorizing some of these item types. These 
researchers have ordered a number of such item types abng an ' 'openness* * continuum that includes seleci^on^dentification* reordering/rearrangement, 
substitution/correction, completion, and construction. See Rtndy E. Bennett, William C. Ward. Donald A. Rock, and Colleen LaHart, '"lb ward a 
Framework for Constructed-Response Items,** ETS research report RR 90-7, 1990. 

wibid»; and James Braswell and J» Kupin, ••Item Format^^ in Mathematics,** Construction Versus Choke in Cognitive Afeasurement, R.E. Bennett 

W.C. Ward (eds.) (Hillsdale. NJ: L. Erlbaum Associates, in press). 
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Box 7-A~The Arizona Student Assessment Program^ 

Arizona revised its cumculmn substantially and then discovered that existing State-mandated tests were no 
longer appropriate. Tfcachers carried a heavy annual testing burden, but remained unsure how the various tests 
corresponded to what they were expected to teach. Describing the old State-mandated testing; required in grades 1 
throui^ 12 every spring, using the Iowa Tbsts of Basic Skills (iTBS), Tbsts of Achievonent and Proficiency (lAP), 
and district testirig under the Continuous Uniform Evaluation System (CUES), one teacher ex[^sed frustration: 

^Ve have these CUES tests, pre- and post-test ... In one grade we have 135 litde skills tests in all of those fornis, 
pre- and post*test. We teach what we think is important to teach . . . until right before our CUES tests. Then we teach 
students how to do well on the CUES tests. We also ghre die lowalbsts of Basic Skills and it takes about a week. 
We teach what we think is in^rtant all year Icmg . . . undl right befcie the ITBS. Then we teach students how to take 
the ITBS — We get the scoies back on the ITBS right before students leave for the summer, and I usually have to 
follow students out the door on the last day with a stq)ler in one hand and the test scores in the other so I can staple 
the score reports onto their report cards. We have an entirely different group of students over the next year so that 
it doesn't do much good to analyze the test scores over the summer. ... I feel confused. What are we supposed to 
teach? What is valued? It seems to noe we are spending a great deal of time getting ready for two measures that are 
at odds with what we have agreed in my district is important to teach.^ 

Statewide curriculum frameworks, known as Essential Skills Documents (ESDs), were developed starting in 
1986, to outline broad competencies and goals at the elementary, middle, and high school levels across the State.^ 
Most teachers enthusiastically embraced tfie documents but some lamented: ''That's the way Fd like to teach . . . 
if it weren't for tfie way we test.*'^ Reflecting this concern, the State legislature set up a joint committee in 1987 
to review the overall teaching and assessment program in the State, looking especially to see if the skills and 
processes identified in the Essential Skills curriculum framewon'ks were being successfully acquired by Arizona 
students. 

An independent committee analyzed whether the skills required in the ESDs were being assessed in the ITBS 
and lAP. Results for matiiematics, reading, and writing indicated tiiat only 20 to 40 percent (with an average of 
26 percent) of the Essential Skills were assessed by the ITBS and TAP. Thus, even with annual testing for all grades, 
Arizona was only receiving information on how well smdents were mastering one-quarter of /the content of the new 
curriculum. As one teacher said: 

The teacher in Arizona can't serve two niasters. If they want the teachers to do a good job of teaching math 
they can use the Essential Skills Documents ... and throw out the iFBS tests, or teach the ITBS tests and throw out 
the Essential Skills Documents.^ 

With the support of teachers, school boards, administrators, and the business conununity, the legislature passed 
State Law 1442 by a landslide. The act required the Arizona Department of Education to create an assessment plan 
that would do a better job of testing tfie Essential Skills. Thus the Arizona Student Assessment Program (ASAP) 
was bom in tfie spring of 1990, setting a new approach to State testing. 

ASAP is an umbrella program composed of new performance measures, continuing but reduced emphasis on 
norm-referenced testing, and extensive school, disdict, and State report cards. Riverside Publishing Co., tfi« same 
company tfiat produces tfie TAP and ITBS, was selected to produce the, new assessments at the benchmark gmdes 
of 3, 8, and 12 in each of tfie tiuree subject areas, lb best match Uie goals of tfie ESDs tfie new tests were to be 
performance- and curriculum-br^ assessments. The language arts assessment is an interesting example. 
Parff^'-'ing the way writing is taught under tfie language arts framework, tfie assessment is a two-step process. On 
tfie first day of testing, smdents engage in the steps that make up tfie ••prewriting** process (e.g., brainstorming, 
listing, mapping, or *'webbing'' ideas) and creating a first draft; on tfie second day of testing, tfiey reread tfie draft, 



iMuch of Uiis discussion is takno from Rutb Mitchell aiKl Amy Steii^^ ''Six Case Studies of PeifonnaDco 

Assessment,'' OTA contractor report, Much 1991. 

2U)is Brown Easton, • 'Developing Educationd Perlormancc Tbst^ 
L. Finch (ed.) (Chicago, IL: Riverside Publishing Co.. 1S91), p. 47. 

^Thc language arts framewoik was puUished ii l.)86 and the mathematics framewoilc hi 1987; by the cad of 1990, Essential Skills 
Documents were available hi 12 subjecu hicluding, m adcU ion to the above, framewoiks hi science, health, social studies, and tlie aits. Mitchell 
and StempeL op. cit, footnote 1. 

^Easton, op. cit., footnote 2. 

^Arizona Department of Education, Arizona Essential Skills for Mathtmatics (Phoenix. AZ: July 1987), p. i. 

Continued on next page 
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Box 7-A— The Arizona Student Assessment Program^— Continued 

Fcvise, and write a stcond draft. Similar perfonnance-based mttmeau have been created for mathematics and 
leading, with sciense and social studks asfesnnen^ 

The fint official assessment wiU be impleineiited in March 1992 md icoced by teacfaen at legiooid icoring 
sites, none mora than an hour 'i drive ftom any district Caissf^ 
and wiU leodve a sniaU stipend and gnMloite cmlit Itor their wodc. b 
be leUable between leadcfs as wdl as consistent when aieader was 

Scoring also took less thne than expected.* Havinf the dassnxxn teachen scom the examinatioas is seen as a 
positive staff developmett «cthrity» as teachen become invohred hi setting coninion qaa% 
the review process with thefr coUeagoes ftcMtt around tlw State. 

Norm-referenced tests (NRTS) aro being ceaiiracd as a way to conyaie AriMaa'i ntndant aAitarnmfnt •yiitf* 
a national testing nfnenc^ However, their hifioenoe is being re^^ 

and TAP each year (le.. subtests. WMher than the ftJl tost hatteiy>. lednefay toat-taUng rime «w^n hy nn^AmW^n, 
tw<Hhiids.7 The nonn-roSnienced testhig win b^ 

derived fiom spring testing had been ocnsiderBd a refkctian of what the teachers tai^ 

if the test contmt did not always comspood to what was actually taught Ibacben often feh piessuied to sp«nd 

constdersUe tfane pieparing students the 9ring tests. With fUU teittag. ^ 

thetests with more e(|iianhnity,anddie>ewiUbe less piessuro to ''piep** students. Fa^ 

wiU be returned in time to be used for that year's histnictional piMning. 

The third component of ASAP changes the way school and district achievement will be reported 
each July dihigs got "hot" in Arizona, as newspaper stories listed eveiy school to a district alo^ 
scores o:^ the TM> and ?IBS. Little inteipntative faifonnatian was provi^ 

higher the score, the better the school Tbc new repotting system will try to paint a more realistic picture of 
achievemem M die schod. district, and State leveL These annual "Arizona lUp^ 
Sldlls scores. NRT scores. sndodierlk«ntfuttn^crachievement(e.g..nun^ 

science fair winners, and special award whners). However, to set these hi context. fiKtoa dot ttffecf Achievement 
«re also reported, such as student socioeconomic status, mobility rate, percentage of students with limited English 
proikiency. and fiK»hy turnover nrtes. Ahhough it is assumed that icfaool and dist^ 
to be made, it is hoped dutt these comparisons will be node on a more meaningfol and reaLs^ 

When the new prc/gram was introduce to teams ot 850 teadiers firom across the State at a Snday confertnce 
in October 1990. teacher reaction was mixed. Although many were pleased with the new approwli. they were 
oonoei wd widi the dillknlty of putting die new system into place. As one said: "The M 
incredible. We need staff devek^mient on pedagogy, on writing, on logic, eveiythhig. lb do tiiis hi the timefiame 
we h/ive. we need Ug buclES." 

A8se«8n«it costt are difficult to determine because the change in assessment is aligned to changes across the 
sy 8tem-«speciaUy cuiricuhun development and professional developuMnt. Money saved from less HBS and lAP 

testtog wiU be used for aU diree parts of die ASAP fa comtog yeari-Hhe l«Ts. peifonnanoe assessn^ 
nontest hidicatofs. Neveidwless. costt for the prognun (the request for proposal for devetopfag the new 
peifoimanoe-based assessments, die statewide teacher conference, preparing teacher scorers, and tnhJng all 
teachm in die new system) wiU be substanthd. While peihaps an expensive gamble, the Stafe commitment to move 
forward indicates the priority Arizona legislators and educators have placed on introdudng a new aonroach to 
assessment duoughout die State. 



^Butoo. op. ch.. f'^ou^iXe 2, p. S6. 
''Il*l..p.57. 



"back door" approaches, such as strategies of 
elimination or getting cues from incorrect choices. 
Fourth, some of these items can use scoring methods 
that recognize the correctness of a variety of 
different answers, representing the complexity of 
understanding and knowledge. (This suggests the 
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potential diagnostic value of CTR items. These items 
can reveal the processes used by the learner; e.g., a 
scorer can examine the student's problem-solving 
steps and detect errors in itvajioning or misconcep- 
tions). And, finally, one of the most often cited (but 
least documented) assumptions is that these items 
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Box 7-B— Kentucky's Performance Assessments and Uilued Outcomes^ 

Kentucky is flmdamentaUy ledesigDing its State edocatioiial ^steta When the 1990 Kentuoky Eduction 
Reform Act is fully impVemcnted, tibe State will have tbe first syrtem that measure! student •cfaievement cotirely 
by iwifoniuiice-based tettiiig. It win also be 
and punished based on test lesults.^ 

Itt retiiinkipg basic fd t icati on al piactices and premises, Kentuclqr educaton Hope to give ctoisroom teacfaera 
a kuger voice and improved ability to report on what they beUeve a student luuack^ hope to move away 

from the common model that vahies the results of St«»«dininistered nonn-mfereacc> more hi^ than 
clas8n)om^Msed testing and teacher's giade caids. lie goal is to intestate teadi^ 
invisible to the student, minimiring the use of external instwmccta as much as possible. The Keatudy approach 
will require extensive trafaiing of leadKr as weD as a bacfaqi system to ensure qutdi^ooMioL 

Under the guidance of a Council «a School Fmforaiance Standards, 11 task Ibroes involving some 1,000 
educators are woridng to identify the activities needed to define expected student outcomes and set the level of 
proficiency desired at duee "anchor points*': the 4th, 8th, and 12ih grades. Itechers will con*inui% evahiate 
students on a less fonnal basis in the inteiim grades to be sure progress is beiog inade by aU students as tfai^ 
for the beocfamailc pezfonnance levels. Additionally, as younger children watch the peilbonanceof older peen, they 
wiU be encountged to inodd ihemseWes on the c4der students and see how close they are to that level <f p^^ 
This approach is based on » sports metaphor, with the studenu paitioipating in "scrimmages' ' that 
tests at eariier gnKle levds. Younger students are simiUv to the "junior varsity'' as ^ 
leam from watching the "varsity," older students at higher levels of peifbnmnce. 

Benchmark grades will be tested each year but reported every other year for accountab/Jity puiposes, 
SttcoessfU schoob Witt reoehre monetaiy rewards from the Stale; unsuccessfU schools wiU be letpj^ 
plans for improvement If a school is particuterly unsuccessfiil, it may be declared a "school in crisis" and its 
studenu may be permitted to transfer to more successful schools or administratm may be replaced and 
"distinguished educators" may be brought in to he^.' 

In die summer of 1991, a contnKtor was sdected to estate die 1993-96 petfornumce assessnients in 
arts, sdence and technok)gy, malhematks, social studies, artt and humanities, practical living, akid vocational 
smdies. Development costs over tte first 18 mondu are estimated to be appioodmatefy $3.3 million. An interim 

testing program administered to a sample of students during the 1991-92 school year win provide base^ 
sckiool success during 1993-94. Hie interim test hu been controversial because of its traditional nature; some ttat 
it could sidetrack inplementation of die fiiU program of peifomimco4)ased measures. 



^MDch of lUi bm Is trim ftm Kcotnclgr DepoiB^ 
Stndeot AsieinHM nogim for ConuMoi^^ 
commoniorion. Jmt 1991. 

^"Vtia»t,"BdmaloH Wttk, voL IQ, No. 40. My 31. 1991. p. 33. 

3M^rHekoMIltf. KevktNoltDd. aul Jolm Schalt A Gm^ to Rifi>mActofl990 (Fruklbd. KY: LeglifaNjlve 

Reieafch CommUrion, April 199Q). p. 3. 



tap more sophisticated reasoning and thinking proc- 
esses than do multiple-choice items. 

California has been a pioneer in the effort to use 
open-ended CR items. In 19?)7-88, the State piloted 
a number of open-ended mathematics problems as 
part of the 12th grade State test. Some of the 
questions were intentionally structured to be broad 
to aUow "... students to respond creatively, demon- 
strate the full extent of their mathematical under- 



standing, and display the elegance and originality of 
their thought processes."'^ One such question, 
along with representative answers, is pictured in 
figure 7-3. As the sample answers suggest, some 
students demonstrated a high degree of competence 
in mathematical reasoning while others displayed 
misconceptions or lack of mathematical understand- 
ing. Sixty-five percent of tiie answers to this 
question were judged to be inadequate, leading the 
developers to surmise that: "... the inadequate 



'^aUfoniia SUte Depaitment of Education, "A Question of Thtnldiig: A Fiist Look at Students' Perfonnance on Open-Ended Questions ia 
^'-'hematirs." unpublished report, 1989, p. 3. 
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Box 7-C~The California Assessment Program: Testing That Goes ^^Beyond the Bubble''^ 

The Califomia Assessment Ftogram (CAP) was created in 1974-75 as part of an eariy school itf onn program. 
It has evolved ovor the years to reflect changes in curricula, student populaticm, and pressures for accountability, 
but CAP omtinw^s to be seen as a model for other States, primarily due to two factors: the State carefully defined 
cuiricular objectives as the starting point for assessment, and devoted considerable research and support to the 
development of new forms of assessment. 

Bringfaig education reform to a State as laige as California, laiger in population than many Eluopean countries, 
has been a mmumental task. The main vehicle for change has come witfi the creation of statewide curriculum 
frameworks — documents developed starting in 1983 in response to a major school reform bill. These curriculum 
guidelines and frameworks have been modified over time and now center on devdoping students' ability to think, 
to appjty concepts, and to take responsibility for their own learning. The frrjnewofks mandate a curricula that is 
*\ . .literature-based, value-laden, cultural^ rich, and integrated across content areas. Writing across the 
curriculum, cooperative learning, experiential learning, and problem solving are emidiasized Although the 
frameworks are not mandated, they are the bash for the mandated CAP assessments, creating indirect pressure on 
districts to align the curriculum and instraction. 

It became clear that much of what was to be taught with the new frameworks would not be taught or assessed 
appropriately if student achievement was evaluated with existing multiple-choice tests. A shift to performance 
assessment was sought to briqg curriculum and instruction in line with tfie frameworks. The first performance 
assessment component, a direct writing assessment, was developed by teachers and put into place in 1987. Each 
year several hundred teachers gather over a 4- to 6-day period at four sites across the State to score the essays. 
Ibacher scoring is emphasized to enhance the connection between instruction and assessment 

The success of the effort seems to validate this ccmnection and meet expectations. One report suggests that: 
I". . . educators throughout Califomia have expressed the belief that no single program has ever had statewide 
impact on instnicticm equal to that of the writing assessment"^ A study at die ccmipletion of the first year of the 
writing assessment found that 78 percent of the teachers surveyed reported they assigned more writing, and almost 
ail (94 percent) assigned a greater variety of writing tasks.^ The percentage of students who reported that they wrote 
Hot more pi^rs in a 6-week period jumped from 22 to 33 percent. The writing assessment has also motivated 
a huge increase in staff development, with the Califomia Writing Project training over 10,000 teachers in support 
of improved instruction in writing.^ 

In December 1989, Califomia held an Education Summit, in response to the Naticmal Education Summit of 
the Nation's Governors in Chariottesville, Virginia. In seeldng areas most likely to produce significant change 
(''targets of opportunity' and building on the stxengths of the Califomia system, the educators called for statewide 
performance goals that would be measured through a strengthened assessment system. The report stated: 
The fundamental objectives of educational testing in Califomia schools are far from fulfiUed. The dominant testing 
methods and formats not only fail to support the kind of teaching and learning that the state and national curriculum 
reform movement calls for, but actually retard that movement in Calffornia. Students, teachers, and parents aie not 
getting the necessary infomuuion to gauge the educational system's progress, detect strengths and weaknesses, 

improve instruction, and judge overaU effectiveness The cunent q)proach to assessment of student adiievement 

which relies ^ mult^le choice student response must be abandoned because of its deleterious effect on the 
educational process. An assessment system which measures snident achievement on performance-based measures is 
essential for driving the needed reform toward a thinking curriculum in which students aie actively engaged and 
successful in achieving goals in and beyond high school^ 



iRuth Mitchell and Amy Stcmpd, Council for Basic Education. • 'Six Case Studies of Performance Assessment," OlA oontractorfCDort. 
March 1991. 

^ofth Coilral Regional Educadonal Laboratory and Public Broadcasting Service, "Mulddimensioual Assessment: Stratcdcs for 
Schools/' Video Conference 4, 1990. p. 27. 

^California Assessment Program, "Califomia: The State of Assessment,** draft rq>ort, Apr. 3, i990, p. 8. 

^An evaluation of die grade eight writing assessment by die National Center for the Study )f Writing at Ae University of California, 
Bericeley, dted in ibid. 

^nrid., p. 8. 

^California Department of Education, California Education Summit: Meeting the Challenge, the Schools Respond, final rwwrt 
(Sacramento, CA: Febniaiy 1990). 



^ 21. 



Chapter 7— Performance Assessment: Methods and Characteristics •211 



The diitct writing assessment was cited as an example of the Idnd of assessment needed to drive piogiam 
imjHOvements, The summit thus gave support and fiirtfier stimulus for cmtinuing research and pil<^g of new 
methods. 

In die past, statewide testing used matrix sampling, in which each student takes only a poitim of the test and 
scofes are reported on the school ot district level, but not for individual students. However, recent legislation'^ 
mandates that beginning in 1992-93 individual testfang will be conducted statewide in grades 4, 3, 8, and 10 in basic 
skills and content courses. The use of direct writing assessment and other performance-based assessments is 
encouraged Districts can also choose tfieir own student tests at other 

California cuiriculum fremeworics, with repoiting based on conunon performance standards. The new pn^gram 
gives special emphasis to end-of-course examinations for secondaiy sdiool subjects. These will be based on die 
existing Golden State Examinations, which students now take on a voluntaiy basis at the ccmipletion of Algdna, 
Geometry, Biology, Chemistry, U.S, Histoiy, and Economics. Districts may require that all students take one or 
mete Golden State Examination. Fmally, the integrated student assessment system will also inchide a portfolio for 
all students graduating from high school Hie portfolio will contain documentation of peiformance standards 
attained (Hi the grade 10 test (or oth^ forms of the test taken in grades 11 and 12), on end-of-course Golden State 
Examinations, and on vocaticmal c> ;tification examinations, as well as evidence of job esqperience and odier valued 
accomplishments," 

This represents a big jump in required testing. Performance-based components are def Jied as building blocks 
for all the tests, bodi CAP and district-administered CAP has indirectly influenced the testing done at the district 
level by *\ . . opening die door . . , giving permissim to go ahead with perfoimance assessment.'*' CAP also has 
pilot projects for portfolios in writing and mathematics, and research studying the impact on instructicm of 
open-ended mathematics questions* 

Developing performance-based assessments is not a simple task* At the 1987 *'Beyond die Bubble'' 
conference on testing, educates grilled with the issue of developing new ways to produce alternative assessments 
diat more directly reflect student performance* A suggestion to support grassroots efforts by teachers witfi assistance 
£rom assessment experts eventually led to the Alternative Assessment Pilot Project. In 1991, die Governor 
autfiorized $1 million to implement its provisions, and two consortia of California school districts (one in die north 
and one in die south of die State) have been given grants totaling over $965,000 to begin die project. Each 
'consortium will develop, field test, and disseminate alternatives to standardized mtdtifde-choice tests for assessment 
of student achievement. At die school level, teachers will develop their own materials and strat^ies and pilot diem 
widi dieir own classrooms and schools, sharing information widi odier teachers across die State. A cost-benefit 
analysis of die local use of current performance-based assessment systems will also be ccmducted.^^ 

Because of die scope of diese endeavors, many odier States are looking to die California experiment as a guide 
to their own efforts to realign testing and curriculum. 



"^Chapter 7ti0» aiiforoia Statatei of 1991 (SB 662; Hut). 

'SiqperlntauleiU Hooig» CUifofnit State Dqwitment of EducaticMi* ''New Integrated Assessment System,'' testimony before the State 
Assembly Edocaiioa Committee* backgroond infonuation, A^g. 21, 1991. 

^uben Canledo, dkectw of Planning, Research and Evaluation Division, San Diego City Schools, cited in Mitchell and Stempd, op. 
cit., footnote 1, p. 17. 

^^todifomia Department of Education News Release, Aug. 2, 1991. 



responses of a large number of students occurred 
primarily because students are not accustomed to 
writing about mathematics.*'^** 

The National Assessment of Educational Progress 
(NAEP) has also successfully utilized a vatiety of 
open-ended items. In the 1990 NAEP mathematics 
assessment, about one-third of the items included 
open-ended quesiions that required students to use 



calculators, produce the solution to a question, or 
explain thek answers. The 1990 reading test, which 
also employed text passages drawn from primary 
sources, including literary text, informational text, 
and documents, used a number of short essays to 
assess the student's ability to construct meaning and 
provide interpretations of text. The 1985-86 NAEP 
assessment of computer competence included some 



Q >^Ibid.,p.6. 
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Figure 7-3--Open-Ended Mathematics Item With Sample Student Answers 

QUESTION: James knows that half of the students from Ms school are aooepted at the public university noari)y. Also, half are accepted 
at the local private ooHege. James thinks that this adkls up to 1 00 percent, so he will surely be accepted at one or the other institution. Explain 
why James may be wrong. If possible, use a diagram In your explanatkm. 

Good Mathematical Reasoning: Sample Ansvirere 



qpjkSA TlSJSt^ ^ aid(^ M> 4o loo% » 



2P IOIAjI MAoHotojf 





Misconceptions: Sample Answers 

e^0^<^ 'c^ecoi^C« eoefv^ o^e ^^ctsivV 
Tlc^t ^f^+o pr;^4^ rode^Q 

NOTE: i>8«dlnlh« 1987-68 v«rftlon of the 1 2th gredoCalHomla AssMsmont Program t«st, this logioproblam ossoasoa a student's ability to ctotact and explain 
faulty raasonlng. Answers are acorad on a 0 to 6 pdnt scala. The student nrnist give a dear and mathematically oorrect explanation of the faulty 
reasoning. For the highest score, responsee must be complete, contain examples and/or counter examples of overlapping sets, or have elegantly 
expressed mathematics. A diagram Is expected. 

SOURCE: California State Departmem of Education, ^ O/esflbnofTTi/n/c/ns^M 
^ (Sacramento, CA: 1989), pp. 21-28. 
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open-cnded items asking students to write short 
computer programs or indicate how the ''turtle'' 
would move in response to a set of computer 
conunands; students weie given partial credit for 
elements of a correct response. 

Scoring: Machines and Judges 

Researchers and test developers are now consider- 
ing ways 10 streamline available methods for scoring 
the more open-ended CR items. One promising area 
involves new types of CR items that can be entered 
on paper-and-pencil answer sheets and scanned by 
machines.^*^ Ont such item type for mathematics 
problems is the grid-in format. Students solve the 
problem, write their solution at the top of a gr^ 1, and 
then fill in a bubble corresponding to each number 
in the column under that number (see figure 7-4). 
Questions that have more than one correct answer 
are possible, and the format allows for the possibility 
of answering in either fractions, decimals, or inte- 
gers. 

''Figural response' ' items, which require drawing 
in a response on a graph, illustration, or dia^am, 
were field tested in the 1989-90 NAEP science 
assessment (see figure 7-5). The feasibility of 
machine scoring of these items was also tested by 
using high-resolution image processors to score the 
penciled-in answers. Some initial technological 
difficulties were encountered with the scanning 
process — many student answers were too light to be 
read and the ink created some interference. How- 
ever, the researchers express optimism that the 
scanning mechanism can be made to work.^^ 

Researchers are working on technologies of 
handwriting recognition that will eventually result in 
printed letters and numbers that can be machine 
scanned from answer sheets, but these technologies 



FIgura 7-4— Machlne-Scorable Formats: Grid-ln and 
Multlple«*Cholce Versions of a Mathematics Kom 

The Question: 

Section I of a certain theater contains 12 rows of 15 seats each. 
Section II contains 10 rows, but has the same total number of seats as 
Section I. If each row in Section II contains the same number of seats, 
how many seats are in each row? 



Tettl.Multlpki 
Choice Vtrtlon 



Test 2, Grid 
Version 
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(D) 
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NOTE: This Itom was dMlgn«d for Ngh school Juniors and seniors. 

SOURCE: Educational Testing Ssrvlce, Policy Information Center. ETS 
MksyNot§a, vol. 2, No. 3, August 1990, p. 5. 

are still far from reliable except under optimal 
conditions — the letters must be cleanly printed and 
properly aligned. Systems that can read cursive 
handwriting are in a more experimental stage; 
whether the'\ . « scrawl likely to be produced under 
the pressure of examinations . . / ' could ever be read 
by a computer is questionable.^^ 

CR items vary considerably in the extent to which 
they can be scored objectively. More objective items 
will have scoring rules that are very clear and 
involve little or no judgment. Other responses, such 
as short written descriptions or writing the steps to 
a geometry proof, are more complicated to score— in 
part because there are multiple possibilities for 



J^Many of the problems involved in tnacfaine scanning are solved if constnicted-response items can be delivered via computer. If the students take 
a mathematics con]9)utation test via conq)uter, they can simply type in the correct numbers; a short essay can be written en the keyboaid. As a mult, 
thecomputerisinmany waysamore "friendly** system for the delivery of many constructed-response type items, because prohi«n« »>|iit«i tn «fjmn<ng 
in the answer ait solved. The machine-scanning problem is much less tractable for items delivered via papcr-and-pcncil tests. Sec cL 8 for ftirther 
discussion of the issues involved in administering tests via computen. 

"James BrasweU, "An Alternative to Multipl^Ooice Tfcsting in Mathematics for Large- Volume Examination Programs,** paper piesented at the 
annual meeting of the American Educational R^seaich Association, Boston, MA, April 1990. Orid-in items for maUiematics are currentiy under 
development for both Uie SAT and the ACT college admissions examinations. Preliminary results with college-bound students are eocouiaging: 

Guessing and back door approaches to solving mathematics questions are virtually eliminated and the range of answers Uuit students offer to 

individual questions is great and frequently does not match weU with tJbc distractois provided in multiple-choice versions of die same items. As 

one would expect, the grid-in format requires more time. (p. 1) 

I'Michael Martinez, John J. Feiris, William Kraft, and Winton H. Manning, ** Automated Scoring of Paper-and-Pcncil Figural Responses,** ETS 
research report RR-90-23, October 1990. 

^slie Kitchen, ''What Computers Can See: A Sketch of Accomplishments in Computer Vision, With Speculations on Its Use in Educational 
O jug,** Artificial Intelligence and the Future ofJksting, Roy Freedlc (ed.) (Hillsdale, NJ: L. Erlb^um Associates, 1990), p. 134. 

ERIC 



222 



214 • Testing merican Schools: Asking the Right Questions 



Figure 7-5— Figural Response Item Used in 1990 NAEP Science Assessment 



The map below shows a hlghi>res8ure are^ 
Draw an arrow (-^) over Lake Michigan 



dred over North Dakota and a low pressure area centered over Massachusetts, 
ihows the directkxi In whirn the winds will blow. 



H 
High 
pressure 




Lake Michigan 



^8 



V 



KEY: NAEP - National AsMSsmant of Educational Prograss. 
NOTE: This itam waa uaad with 8th and 12th gradara. 

SOURCE: Mlchaal E. Martinaz, "A Comparison of Multlpla^holca and Constructad Figural Rasponsa Itams/' papar pratiantad at t ha annual maating of the 
Amarican Educational Rasaarch Asaodatlon, Boston, MA, April 1090. 



correct or partially correct answers. Machine scoring 
of even more complex products, such as the steps in 
the solution of algebra word problems or computer 
progranmiing, proves to be much more complicated; 
preliminary work drawing on artificial intelligence 
research suggests that automated scoring can even- 
tually be developed. However, the time and cost 
required to develop such a program is very high. ' 'In 
both instances, the underlymg scoring mechanism is 
an expert system — a computer program that emu- 
lates one or more aspects of the behavior of a master 
judge.^^21 

One of the more difficult and long-term problems 
of developing artificial intelligence models to score 



constructed responses is building their capacity for 
error detection. Programming machines to recognize 
correct answers is far easier than programming them 
to detect errors, grade partial solutions, and provide 
evaluation of enor patterns.^ When questions that 
allow for more than one right answer are used, 
programming of the scoring can get quite compli- 
cated.^^ Yet one of the highly desirable features of 
CR items is their potential for diagnosis of miscon- 
ceptions, errors, and incorrect strategies.^ 

Althougli most CR items still require human 
scoring, procedures exist that can eliminate error and 
make this scoring more reliable. Development of 
clear standards for judging student answers and 



ziRandy Betmett* "Tbward Intelligent Assestment: An Integradon of Constructed Response Ibsting, Artificial Intelligence, and Model-Based 
Measurement," ETS lesearch report RR-90*5, 1990, p. 5. For a description of artificial intelligence applied to a constructed-response computer 
piogrammiog problem, see Henry L Braun, Randy E. Bennett, Douglas Frye, and Elliot Soloway, "Scoring Constructed Responses Using Expert 
Systems/* Journal of Educational Mtasurtment, vol. 27, No. 2, summer 1990, pp. 93-108. 

22Roy Rreedle, ''Aitificial Intelligence and Its Implications for die Future of ETS's Ibsts,** in Freedle (ed.), op. cit., footnote 20. 

33Brasweil and Kupin, op. cit., footnote 14. 

^See Menucha Birenbaum and Kikumi Itouoka, ''Opeo-Eaded Versus Multiple-Choice Response Formats— It Does Make a Difference for 
O agnostic Purposes,* * Applied Psychological Mtasurmenu vol. il, No. 4, 1 987, pp. 385-395. 
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intensive training of judges until they reach accept- 
able levels of agreement are important components 
of establishing high inter-rater reliability (see 
discussion in ch. 6). Preliminary indications are that 
most CR items can be scored with inter-rater 
reliability equal to or better than that achieved by 
judges grading essays. The process of training 
judges to grade essays reliably has been successfully 
developed in some large-scale testing programs; in 
addition, many commercial publishers and other 
companies now offer commercial grading services 
to schools that want independent and technically 
supervised rating procedures. 

The feasibility of scoring geometry proofs on a 
large scale has recently been demonstrated by 'die 
State of Noith Carolina. Because an important 
objective of the high school geometry curriculum in 
North Carolina was for students to leam to develop 
complete proofs, the State assessment program 
included such proofs in the new assessment. All 
43,0(X) geometry students in the State were given 
two geometry proof questions in the spring of 1989. 
Over 400 teachers from throughout the State were 
trained to score the proofs. Drawing on the lessors 
from the scoring of writing assessments (e.g., the 
importance of developing scoring criteria and train- 
ing), high levels of scorer agreement were achieved. 
Actual time devoted to training was less than 3 
hours.^ 

Constructed-Response Items as 
Diagnostic Tools 

One of the features of CR items that makes them 
attractive to educators is that they allow closer 
examination of learners' thinking processes. When 
students write out the steps taken in solving a proof, 
or a list of how they reached their conclusions, the 
students^ thinking processes can be examined and 
scored. Results of one study have suggested that 
CR-type items may be more effective than multiple- 



choice items for diagnostic purposes; i.e., for uncov- 
ering the processes of learners in ways that might 
help a teacher better understand students' errors or 
misconceptions.^^ 

Not only might errors and misconceptions be 
more readily uncovered, but students' abilities to 
generate and construct meaning in complex tasks 
can also be assessed. The methods for developing 
these more complex scoring systems are not yet well 
established or understood. Cognitive research meth- 
ods (see ch. 2) are beginning to be applied to the 
development of scoring rubrics for CR-type items. 
"Think aloud" methods, where children are closely 
observed and interviewed while solving open-ended 
problems, can provide a rich source of information 
to help build scoring rubrics. Early efforts to 
generate scoring criteria based on cor^aring the 
performance of experts and novices also have been 
encouraging.^ One of the challenges for researchers 
in this area is to develop scoring criteria that have 
general utility across a number of tasks, instead of 
being specific to a particular test question or essay 
prompt.^ 

Although the relative virtues of multiple-choice 
and CR items have been debated in the educational 
literature since early in this century, there are few 
comprehensive empirical studies on Uie topic. Thus, 
although there is considerable "textb(K)k" lore 
about the differences between the two types of items, 
few generalizations can be made with confidence 
about differences in student perfomiance.^ CR 
items have not been widely field tested in large-scale 
testing programs. Very few researchers have col- 
lected data that allows direct comparison of CR with 
multiple-choice items. 

It is fair to say that no one has yet conclusively 
demonstrated that CR items measure more ''higher 
order" thinking skills than do multiple-choice 
items. **A11 the same, there are often sound educa- 



^llic Stevenson, Jr., Chris P. Avwctt, and Daisy VickcR, •'Hic Reliability of Using a Focused-Holistic Scoring Approach to Measure Student 
Performance on a Geometry Proof/ • p^r presented at the annual meeting of the ^ 
^Birenbaum and Iktsuoka, op. cit., footnote 24. 

27See, for example, Kevin CoUis and Thomas A. Romberg. "Assessment of Mathematical Perft -nance: An Analysis of Open-Ended Tbst Items.* • 
and Eva L. Baker, Marie Freeman, and Serena Clayton, "Cognitive Assessment of History for Largc-Scale Tbsting/* Testing and CogniHon, Merlin 
C. Wittrock and Eva L. Baker (edsO(Ei^lcwood Cliffs, NJ:PrcnUcc Hall, 199^^^ o a 6 . 

^Baker et al., op. cit., footnote 27. 

»Sec R^. Traub and K. MacRury, •'MulUpl^-Choicc vs. ftee-Response in the Ibsting of Scholastic Achievement,* • Jhsts and Trends 8: Jahrtuch 
der Padagogischen DiagnosHk, K. Ingenkamp and R.S. Jager (eds.) (Weinheim and Basel Geimai-y : Beltz Vcriag. 1990), pp. 128-159; Ross Iwxh. 

On the Equivalence of the TraiU Assessed by MulUple-Choice and Constnicted-Response Tfc«ts,** in Bennett and Waitl (eds.), op. cit., footnote 14; 
Md Thomas P. Hogan, **RelaUonship Between Fhxv-Response and Choice^iypc Tbsts of Achievement: A Review of the Uterature,** ERIC document 
^ "TA 811 (Green Bay, WI: University of Wisconsin, 1991). uin.imicm 
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tional reasons for employing the less efficient 
format, as some large-scale testing programs, such 
as AP [Advanced Placement], have chosen to do."^ 

Essays and Writing Assessment 

Essays, particularly when used to assess writing 
proficiency, are the most commoii form of perform- 
ance assessment. In fact, the noun **essay'' is 
defmed as ''trial, test' ' and the veit) as 'S . . to make 
an often tentative or experimental effort to per- 
form. ''^^ Essays are a relatively well understood 
testing format, in part because they have been used 
for many years. An essay is an excellent example of 
performance assessment when used to assess stu- 
dents* ability to write. Essay questions for assessing 
content master}^ are also a form of performance 
assessment, because they require student-created 
products that demonstrate understanding. The prob* 
lem arises in scoring subject nmtter essays — are 
students' understanding of content being masked by 
a difficulty in written expression? In that case, 
writing skill can confound scoring for content 
knowledge. 

Essays as Assessments of Content Mastery 

Student understanding of a subject has long been 
assessed by requiring the student to write an essay 
that uses facts in context. Essay questions have been 
central to some large-scale testing programs over- 
seas (see ch. 5); they also make up q>proximately 60 
percent of the questions on the Advanced Placement 
examinations administered by the College Board. 
The essay to show content mastery is in fact the 
hallmark of classical education; student writing 
about a subject reveals how fiilly the student has 
grasped not only the obvious information but the 
relationships, subtleties, and implications of the 
topic. The use of writing as an instructional and 
testing device is familiar to scholars, and its use by 
all students is increasingly understood to help 
develop thinking skills as well as communications 
skills. 



Students have different expectations about differ- 
ent types of tests. For example, one study found that 
students report a preference for multiple-choice over 
essay tests . . on the grounds that these tests are 
easier to prepare for, are easier to take, and hold forth 
hope for higher relative scores. Other studies 
have suggested that students study differendy for 
essay tests than they do for multiple-choice tests. For 
example, one study found that students . . con- 
sider open questions a more demanding test than a 
multiple-choice test ..." and use more study time to 
prepare for it.^^ However almost no data exist about 
what students actually do differendy when studying 
for different kinds of tests and evidence is ambigu- 
ous regarding whether the?e different study strate- 
gies affect actual achievement.^ 

Essays as Tests of Writing Skill 

Many large-scale testing programs have begun the 
move toward performance assessment by adding a 
direct writing sample to their tests. One reason for 
this shift is a concern that the wrong message is sent 
to students and teachers when writing is not directly 
tested. According to one researcher of writing 
ability: 

A test that requires actual writing is sending a 
clear message to the students, teachers, parents, and 
the general public that writing should be taught and 
tested by having students write. Although it may be 
that a test that includes a writing sample will gain 
little in psychometric terms over an all-multiple- 
choice test, the educational gains m iy be enormous. 
The English Composition Ifest, administered as part 
of the College Board Achievement Ibsts, contains 
cMie ?.0 minute essay section in the December 
administration otAy. At that administraticm approxi- 
mately 85,000 students write in response to a set 
topic, and each of the 85,000 papers must be scored 
twice. That scoring may cost in the neighboihood of 
$500,000. The inaease in predictive validity for the 
test is minimal. Admissions officers and others who 
use the scores are probably not seeing a dramatic 
increase in the usefulness of scores despite the 
expenditure of the half million dollars. Howeven 



Bennett. Donald A. Rock« and Minhwd Wang, ''Free-Response and Multiple-Cboice Items: Measures of the Same Ability?** BTS research 
report RR-9Q-8, 1990, p. 19. 

^^Webster's Ninth New Collegiate Dictionary (Springfield, MA: Miriam Webster, Inc., 1983), p. 425. 
^^ub and MacRury, op. cit., footnote 29, p. 42. 

3^Gery D'Ydcwalle, Anne Swerta, and Erik De Corte, ''Study Time and Tfest Performance as a FuncUon of Tfcst Eolations,** Contemporary 
Educational Psychology, yo\. 8, January 1983, p. 55. See also Gordon Warren. "Essay Versus Multiple Choice Tbsts,*' Journal of Research in Sdence 
Ttaching, vol 16, No. 6, January 1979, pp. 563-567. 

^Mary A. Lundeberg and Paul W IV>x, ' 'Do Laboratory Findings on Ibst fi3q)ectancy Generalize to Classroom Outcomes?* * Review of Educational 
O search, vol. 61, No. 1, spring 1991, pp. 94-106; and Traub and MacRury, op. cit, footnote 29. 
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thousands of English teachers in the United States 
consider the money well spent. The political clout 
that a writing sample provides for teaching writing 
and for emphasizing writing across the curriculum 
has no monetary equivalent?^ 

Of 38 States that currently assess student writing 
skills, 36 use direct writing samples in which 
students are given one or more ^'prompts'' or 
questions requiring them to write in various formats. 
An additional nine States have plans to add a direct 
writing assessment. Many districts also use writing 
assessments (see figure 7-1). These tests are used for 
a variety of purposes: some are required to certify 
students for graduation or to identify students who 
need further instruction, while others are used for 
district accountability measures. 

For example, in order to identify students who 
need extra help in writing instruction prior to 
graduation, all ninth graders in the Milwaukee, 
Wisconsin public schools write two pieces each 
spring — ^a business letter and an essay describing a 
solution to a problem in their own life. The 
assessment helps reveal strengths and weakness in 
writing instruction among the district's schools and 
teachers. It is a standardized procedure, with all 
students given the same set of instructions and a set 
time limit for convicting bofli pieces. Scoring is 
done by the English teachers during a week in June. 
The training process and the discussions that follow 
the scoring are valued by the teachers as an 
important professional activity, guiding them to 
reflect on educational goals, standards, and the 
evaluation of writing. The central office staff fmds 
this one of the best forms of staff development; by 
clarifying the standards and building a consensus 
among teachers, the writing program can be more 
cohesively delivered throughout the district.^^ 

The testimony of practitioners like the Milwaukee 
teachers supports the positive effects of tests using 



writing samples on writing instruction. It also 
appears that the positive effects of direct writing 
assessments on instruction are enhanced when 
teachers do the scoring themselves. In 19 of the 36 
States currently assessing writing with direct writing 
samples, teachers from the home State score the 
assessmeats.^*^ 

A recent survey of the teachers involved in the 
California Assessment Program's (CAP) direct as- 
sessment of student writing found that, as a result of 
the direct writing assessment, over 90 percent of 
them made changes in their own teaching — either 
the amount of writing assigned, variety of writing 
assigned, or other changes.^^ Most report that they 
believe the CAP writing assessment will increase 
teachers' expectations for students' writing achieve- 
ment at their school and that the new assessment will 
strengthen their school's English curriculum. Fi- 
nally, there was almost unanimous agreement with 
the position that: . . this test is a big improvement 
over multiple choice tests that really don't measure 
writing skills. (See also box 7-C.) 

An informal survey of practitioners using direct 
writing samples found these effects: increased qual- 
ity and quantity of classroom writing instruction, 
changed attitudes of administrators. Increased in- 
service training focused on teaching writing, use of 
test results to help less able pupils get ''real help," 
and improvement in workload for English teach- 
ers.^ However, some practitioners noted possible 
negative effects as well, including the increased 
pressure on good writing programs to narrow their 
focus to the test, tendencies of some teachers to teach 
formulas for passing, and fears that the study of 
literature may be neglected due to intense focus on 
composition. 

Because essays and direct writing assessments 
have been used in large-scale testing programs, they 
provide a rich source of information and experience 



330cmiideCoiilan. -Objective' McMim 
and Richard A. DoDovan (cds.) (New York, NY: LoDgman, 1986). pp. 1 IQ-l 1 1. emphasis added. 

^Doug A. Archbald and Fred M. UtmnBon^BeyondStandardizedTesting (Rcston, VA: NaUonal Association of Secondary School Principals. 1988). 

^Thc 19 States in which teachen participate as scorers are: Arkansas (voluntary). California, Connectkut. Georgia. Hawaii, Idaho. Indiana 
(voluntary), Maine, Maryland. MassachusetU, Minnesota, Missouri, Nevada, New York, Oregon, Pennsylvania, Rhode Island, Utah (vohintarj ), and 
West Virginia. In rwo-thirds of these States, teachers are trained by State assessment personnel. In the other on^-thiid, they are trained by tlM^ ccmtittctor. 

"California Assessment Program, ••Impact of the CAP Writing Assessment on Instruction and Curriculum: A Ptelimlnaiy Summary of Results of 
a Statewide Study by the National Centex for the Study of Writing,** draft report, n.d. Tlie study sampled 600 teachen at California's 1,500 junior or 
middle schools in May 1988, just after Oie second statewide administration of tiic California Assessment Program's grade eight writing test 

351bid. 

lOaT^*^ ^'Objective Tbsta and Writing Samples: How Do They Affect Instruction in Composition^ Phi Delta Kappan, vol. 66, No. 9, May 
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for new attempts at perfonnance assessment. Many 
practical issues, such as scoring and cost, are often 
raised as barriers to the large-scale implementation 
of performance assessnnent. The lessons drawn from 
the history of essays and direct writing assessments 
are illustrative — both for their demonstrations of 
feasibility and promise as well as their illumination 
of issues that will require further attention and care. 
These issues are discussed further at the end of this 
chapter. 

Interviews and Direct Observations 

Qtal examinations were the earliest form of 
performance assessment. The example best known 
among scholars is the oral defense of the dissertation 
at the Master's and Ph.D. levels. There are many 
varieties and uses of oral examinations at all school 
levels. University entrance examinations in a few 
countries are still conducted through oral examina- 
tions. Foreign language examinations often contain 
a porticm assessing oral fluency. Other related 
methods allow teachers or other evaluators to 
observe children performing desired tasks, such as 
reading aloud. 

The systematic evaluation of speaking skills has 
been incorporated into the College Outcome Meas- 
ures Program (COMP) for the American College 
lasting Program (ACT). Tliis test was designed to 
help postsecondary institutions assess general edu- 
cation outcomes. For the speaking skills portion of 
the assessment, students are given three topics and 
told to prepare a 3-minute speech on each. At an 
appointed time they report to a test site where they 
tape record each speech, using only a note card as a 
speaking aid. At some later time, trained judges 
listen to tlie tapes and score each speech on attributes 
related to both content and delivery. 

Methods that use interviews and direct observa- 
tions are particularly appropriate for use with young 
children. Young children have not yet mastered the 
symbolic skills involved in conmiunicating through 
reading and writing; thus most paper-and-pencil- 
type tests are inappropriate because they cannot 
accurately represent what young children have 
learned. The best window into learning for the very 
young may come from observing them directly, 
listening to them talk, asking them to perform tasks 
they have been taught, and collecting samples of 
their work. This approach uses adults' observations 
^ > record and evaluate children's progress in lan- 
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Paper-and-pendl tests are often Inappropriate for young 
children. This teacher, In South Bmnswick, New Jersey, 
keeps a portfolio of her observations as she records 
each child's developing literacy skills. 

guage acquisition, emphasizing growth over time 
rather than single-point testing. 

Several States (i.e., Georgia, North Carolina, and 
Missouri) have developed statewide early- 
childhood assessments designed to complement 
developmentally appropriate instruction for young 
children . Most of these developmentally impropriate 
assessments are based on an Englifih model, the 
Primary Language Record (PLR) developed at the 
Center for Language in Primary Education in 
London. The PLR is a systematic method of 
organizing the observations teachers routinely make. 
It consists of two parts, a continuous working record 
and a summary fomi, completed several times a 
yeai*. The working record includes observations of 
the child* s literacy behavior, such as '^running 
records*' of reading aloud, and writing samples, as 
well as a list of books the child can read either in 
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English or the language spoken at home. The 
summary record includes an interview with the 
parents about what the child likes to read and do at 
home and an interview with the child about his or her 
interests. The interviews take place at the beginning 
and end of each school year. The summary record 
goes with the child to die next grade, throughout 
primary school The South Brunswick ^ew Jersey) 
schools have recently incorporated this iqpproach 
into a teacher portfolio for assessing each student's 
learning in kindergarten through second grade (see 
box 

One assessment technique, used in South Bruns- 
wick as well as many other schools, is known as 
''reading miscue analysis.'' The teacher sits with an 
individual student, listens to him read aloud, and 
systematically records the errors he makes while 
reading. From this analysis, which requires training, 
teachers can determine what strategies each child 
uses while reading. This can be a very useful 
assessment technique for all children, and especially 
in programs focused on improving reading skills in 
disadvantaged children. 

The Georgia Department of Education has re- 
cently developed a new kindergarten assessment 
program (see box 7-E). One important com|)onent of 
this assessment is repeated and systematic observa- 
tions of each child by the kindergarten teacher in 
many skill areas throughout the year. In addition, 
each kindergarten teacher receives a kit containing 
a number of structured activities that resemble 
classroom tasks. A teacher spends individual time 
with each student conducting these activities, which 
assess the child's skills in a number of areas. For 
example, one of the identified skills in the logical- 
mathematical area is tbc child's ability to recognize 
and extend patterns. The teacher presents the child 
with a task consisting of small cut-out dinosaurs in 
a variety of colors. Following a standardized set of 
instmctions, the teacher places the dinosaurs in a 
sequenced pattern and asks the child to add to the 
sequence. Several different patterns are presented so 
that the teacher can assess whether the child has 
mastered* this skill. If the child does not successfully 
complete the task, the teacher will know to work on 
related skills in the classroom; later in the year the 
teacher can use another task in the kit, this time using 
cut-out tracks or flowers, to reassess the child's skill 
in understanding patterns. Through this process, in 



which the teacher works directly with the child in a 
structured situation, the teacher is able to obtain 
valuable diagnostic information to adjust insdruction 
for the individual child. 

Exhibitions 

Exhibitions are designed as inclusive, compre- 
hensive means for students to demonstrate compe- 
tence. They often involve production of conq)rehen- 
sive products, presentations, or performances before 
the public. Hiey usually require a broad range of 
competencies and student initiative in design and 
implementation. The term has become popularized 
as a central assessment feature in the Coalition of 
Essential Schools (CES), a loose confederation of 
over 100 schools (generally middle and high 
schools) that share a set of principles reflecting a 
philosophy of learning and school reform that 
eniphasizes student-centered learning and rigorous 
pedormance standards. 

The term exhibition has two meanings as used in 
the Essential Schools. The most specific is the 
''senior exhibition," a comprehensive interdiscipli- 
nary activity each senior must complete in order to 
receive a diploma. In this r ^ard they are similar to 
the ''Rite of Passage Experience" initiated by the 
Walden m Seniw High School in Racine, Wiscon- 
sin, In order to graduate from Walden m, all seniors 
must demonstrate mastery in IS areas of knowledge 
and competence by completing a portfolio, project, 
and 15 presentations before a committee consisting 
of staff members, a student, and an adult from the 
community."^^ 

The CES senior exhibitions mirror some of these 
requirements, and typically fall into two main 
categories: the recital mode, which is a public 
performance or series of performances; and the 
"comprehensive portfolio" or "exhibition portfo- 
lio," a detailed series of activities, projects, or 
demonstrations over the school year that are cumula- 
tively assembled and provide an aggregate picture of 
a student's grasp of the central skills and knowledge 
of the school's program. 

There is also a general use of the term "exhibi- 
tion" to mean a more discrete perfomiance assess- 
ment when the student must demonstrate that he or 
she understands a rich core of subject matter and can 
apply this knowledge in a resourccf^xl, persuasive, 



^ *iArchbaId and Newmano, op. cit, footnote 36, p. 23. 
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Box 7.D— South Branswick %acher Portfolios: Records of Progress in Early Childhood Learning^ 

H<m do you know if youQg childftt aie devdoping critknd la^^ 
if you do not give diem teiu? lliis is die pfedicament facing many sdwols as educaton become increaiingiy 
disenchanted with giving standaiized paper-and-pencil tests to young childrat When the Soudi Branswick, New 
J eiscy schools adopted a new, moie devek)pmntal!y appropriate curriculum it became necessaiy to devek^ a new 
method of assessmett consistent with dds teaching approach. Uachers worioed i«ilh ^ 
teacher poitfoUo diat diew on several nK>dels, inchiding the Primaiy iMmi^ 

TbacheiB pik)ted the portfolios over the 1989-90 school year, and revised them in the summer of 1990 for use die 
following school year. 

The purpose of the p(xtfolio is to focus on language acquisition k young studentt, grades K dirough 2. 
Ibadiers view die portfolio as a tod to pmnote instiucdon. It gives them a pKtuie of die learning strat^es of each 
child, which can be die basis of developing activities diat will stress students' strengths while providing practice 
and help widi weaknesses. 

Each portfolio consisU of 10 partt, phis one optional part: 

• Selfportrdt—TbechiUisaskedto ''drawapictureof yourself' atdiebeginninganddieend of die 
year. The poitndts cm gaieiaify placed on die front and back covers of a maniU fokler. 

• IhterWew^This may be conducted several thnes during die year and hicludes die child's answers 
questions as: What is your f^oiite diing to do at home? IX> you watch TV? Sesams Street? Do you have 
books at home? What is your finrorite book? 1>) odier people at home like to read? Wliat do diey 1^ 
someone read to you at home? 

• Parent questionnaire— Parents craqilete diis befim dieir first conference widi die teacher. It hidudes 
questions draut the diild' s reading hiterests as well as any concerns die parent has about die child' s language 
or reading developtnent 

• Omceirts about print test~T1iis check list measures die child's understanding of sign^ 
printed bnguage, such as die fiont of die book, diat prim (not diepk^) tells die story, what a lett^ 

a woid is, where a word b«ghis and ends, and big and littfe letters. This is a nationaUy normed test an^ 
also used to identify children in need of comprasatoiy edu(»tion. 

• Word awareness writing activity-lUs records the level at which chikfaen begin to compi^^ 

of fomdng words in didr writing. Progress is lecocded akng a five-stage scale: pvecommunicative (random 
spelling or scribbling); semiphonetic (some sounds represented by letters, e.g., die woid "feet" might be 
rendetedas "ft"); phonetic (letten used appropriatdy for sounds, e.g., "fet"); transitional (some awareness 
^ spelling patterns, e.g., "fete"); or moMly oocrect (10 out of 13 words coirecdy spdled). 

• Reading sample— Ihis is taken diree or mon times a year, the teacher may use a "running record" or 
"miscueanalysis.''TherunntagrBoofdis used widi emergent readers, chikhcnwho mimic die act of ro 

but do not yet know how to read. It records what a young chiU is ddnkbg and doing whUe "readhig 

• Writing sanipl»-This is a sample of die student's fiee writing, "trwisbted" by die stuctent for die teacher 
if invented spelling ^syntax make h difficult to read easily. 

• Suident observation forms (optional). 

• Stoiy retelUng form. 

• Dia^KMtic ftmn. 

• Class record— This class profile hdps die teacher identify dKise children who may need extra attend 
certain areas. It is a one^iNtge matrix with yes^ answers to tiie fdtowing five questions : Does die child pay 
attention in huge and smaU groups? Interact in groups? Retd? a story? Choose to read? Write 

is die only dement of die poitfoUo not a part of die chiM's hidividual record. 

Because of Federal requhements for detemdning eligibility trx compensatory education, die Soudi Biimswick 
schools also use norm-referenced, multiple-choice tests. However, teachers report that these tests are noi useftil 
because diey do not assess devekjpmem to die instructiond qjproach adopted by die Sou^ 



>Miiclior dwmMcdal btUs box eamet fiom R«lh MitdMI uid Amy Stempel, CouncU for BmIc Educatloo. "Six Cue Stndlct of 
F«cfoniiance AMeMtneat," 01A contrMlor npoit, Maich 1991. 
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Hie tMtt fo from Mft to wiiole, and <wr p rog rm n go fiom vrfKde to put Thoie teati in baskally for \mtl» [i^ 
leading lexdwoki], and to mm kids tbac have learned e whole Isigaage by basab iwbea the Soudi BnuuwicL 
ttodenii oaed difldiM'a liienMue ai texti>--4t niakea no 1^ 

llie poctfUkM iKovide a diffietent approach to 
held in gnide brfocebeanue of low test iooni,nieaiGlilias suggeited that having a child rqpeat a year in giade 
nu^ in £Kt cauae uMxe hann tfian good^ In SouA Biunswick, when tfm 

education UbeUng in die eaily grndei, the pottfolio leconl it cootolted to see if die child has made progress. If 
progreu can be shown, dieo the student ii pcxxnoted on die assumption that eveiy child develops at his or her own 
sate and can be monitoied doaety until he Of she teaches die Ifainl gtade. If no 
the ddld is promoted but is identified for conqwusatoiy education. 

One of die pmposes of die poftMio is to help die teadier provide a cleampictuie of student piogiess to 
dun is possibk from sUmdaidized teat Sixxet. Yet a tension lemaiiis 

are derived from nocm-nfrncced, multlple^hoice tests are familiar and understandable. The new devdopoMnnlly 
appropriate mediods of teaching and testing do not havb die perceived rigor or predsion of die old tests. Some 
puents assume durt ooty nonn-nrfei«iioed teats can be objective, and 

oe die poitf(dioa. Some want tiadidonal test acoies dot assure that thek children are learning what evffyone else 
in die countiy is learning— or can be measured agahist children in odier communities. Until this tension is rcmlved, 
fiiU acceptance of a portfolio ttystem imv be slow. As one teacher said: 

Tbt next stq) is to educate die parants. We need woikahopt &r parents. That is die big issue, after we get all die 
teadien settled in using die pofdbUo. Tliig is basicdly not going to be acoqitabte 

eveiyoae can see dial we're gCHhiadng literate Idds and diat's not going to be until mai^, many years from now.^ 

Standanlizadon of die poftfidio assessnaem was not an issue for die teachers, because 0 
histniclional infimnadon tool Since die teadiecs were inv<^ed in die initial design and remain involved in 
fiudifications, aiid as diqr have attended woricshops on its use, diere is inq^icit St 
Biunswick poitfolio is piimarify meant as a feedbadc mechanism to iinprove instni^^ 
an acoountriiility inst tu ment. Tte Educa ti onal TbsdngSeivioeCBTSXworidngwidi die teachew. has produced a 
numerical litBTKy scale based oa die portfolio. The scaie provides a ineans of aggicil^^ 
Central office staff, working widi a consultant ftom BIS, enmined lite 
as evidenced by die poftfoik) on diese scales. Itecheit in one school node die portfblkM : ^ 
order to evahute how well die system communicates standards. The "Soudi Biunswick-Educational Tbsting 
Servk» scate*' Ibr evaluating childrn's progtess in IttenMy is now bdng used in ^ 
scales replace dM first grade standaidizedreadir^ test The existence of aggregatable data wmcle^ 
scoring and die overall vabe of die poitMo in die Soodi Bnmswick puUk sdiods. 

There are additional i^ipioaches to Btandaidizing die portfolio. Some of fhecooteots,^ 
Print test and die WMd Awanoess Writing Activity, can be scored usini akew. 
can alw be scored consistflttdy. tlioie aspects diat cannot be scored u^ a key^g. , die wi^ 
graded by a groiqi of teachers devdoping a nibik from each set of pqten. Tht^ oouM 
exchangingusanvte of portfolios among teachers, so diat each reads about lOpeioent from each d^ 
commnn standards. lUs is die msdiod used by die New Yodc State Depaitment of Education to ensure 
standartit2»tk» of die results of dieir grade four science manipulative skills test It is dro used in several European 
school systems. 

The issue <tf has not been laiaed, since die teachers record eadi student ' 8 growdi against hifflMlf or 1^^ 
not in comparison widi odier students in die class or school However, dds issue will be more prominent if 
achievement levels are sd and diere are differing success fates in ineeting diese standards, or if die poft^ 
for school accountability or tot student selecdon, two goals not cunen^ planned. 



2Wilk Spioef, director of taitnicttoa. Sovtii Bnuiiw^ 
H^SIicfMwd end May Lee Smith, FtwiMnfCnKfef.-X^ 
^MttdteO aad Stempel, op. dt., focooie 1, p. 17. 
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Box Testing \n Early Childhood: Georgia's Kindergarien Assessment Program 

In recent yean, many educalixs and policymaken have been fediidng or eliminating tibe use of standaidized 
paper*and»pcncil teste in ^ eariy gmdes, Mimy of these tests we^ 

retention and iK^ietlier chiMrai wm ready to li^gin first grade. Hie issue of retentioo in the eai^ grades, as weu 
as die role of tests in nuddng such decisions, is recdvii^ inc^ 

Ibxas State Board of Educi^on recently barred the retention of any pu{Hls in prekindeigarten and kindergarten*^ 
The l^gishmires in both Mississippi and Noitii Carolina have elhninated State-mandated testing in die early grades,^ 
At leatt two States, Kentucky iod Florida, are encouraging ungraded immaries (K-3) which loosen the rigid 
boundaries between die early grades and allow children to move according to individual progress. 

In a policy running some^it^ counter to diese trends, the GecMgia Legislature in 1985 mandated dutt all 
6-year-<dd8 must pass a test in order to enter first grade. During the first 2 years of diis policy, a standardized 
paper-and-pencil test was used Ifowever, the use of sudi a test quickly brought to public attention concerns about 
diis annoadi to readiness assessment, includhig: 

1 . the q)propriateneu of a pifwr-andpencfl test for diildien vfbo are five to six years of age* 

2. dio concern dial a focus on tests nanows the curriculum . . . 

3. the need to coosidernot just the chfld's cognidve skills, but the development of social, emotional, and physical 
capacities as wdl. 

4. die need to consider the teacher's observatkms of the child throughout the coutm of the school year.^ 

In re^onse to diese concerns die Georgia Department of Education embarked on a large project to design a 
developmentally appropriate model of assessment. The Georgia Rixsdergarten Assessment Rtpgram (GKAP), 
pitoted during 1989*90, uses two methods of assessment— observations by kindergarten teachers and individually 
administered standardized tasks dud resemble chusroom activities* GKAP asi^esses a child's equabilities in five 
areas: communicative, togicaal-madiematical, physical personal, and social This assessment program is designed 
to he^ teachers make muh^, repeated, and systenuttfc observatkms about each chiki's progress during the year. 
Behavioral observatkms fai all five areas are made in duee time periods throqghout die year. In addition, a set df 
structured activities have been designed to assess each chiM's ccHnmunicative nad Ipgical-madiematical 
capabilities. Hie teacher conducts each of diese activities individinlly widi a chikl If a chikl cannot successfully 
c(mq>lete die task, teachen can plan actiWties to hdp the child w(»^ 

assessing that same skill can be given by the teacher later in die year. Hiese tasks involve toys, manipulatives, and 
cok)rfid pictures. 

Each kindergarten teacher in Georgia receives a GKAP kit diat contains manuals for administration, 
manipulatives, and reporting forms* 'Haining and practice are required prim to die use of GKAP. A self-contained 
video training program devetoped for this purpose has been i»ovided to each school 

The educatkm dqMrtment anticipates that this assessment {Mogram will serve a number of important functions: 

A significant use of OKAP results is to provide instmctkmally relevant diagnostic infotmation for kindergarten 
teachers. In the process of collecting GKAP information, teachers gain insig^ regarding dieir students' 
devdopniental status and subsequent modifications whk^ nuQT be iieeded in diei^ 

when forwarded, dds infonnation will also be usefiil to die cfaild^s teadier at die beginning of lit first grade year. 
Anodier us^ of GKAP results h communication widi patents about dieir child's progress tfaiou^ut the kindergarten 
year. 

The results of die OKAP aie also to serve, along with other information about tlie c^ 
regarding whedier to promote die (Md to die first gnKfe. GKAP results, by diem 
criterion for promotioo/Ktentkm (placement) decisions.^ 



^Ddmh U Cohen, Board VoM to Forbid ReieiMioo Before the Itt Orade." EdkcaHan Week, voL 90, No. 1, Aug. 1. 1990. 
^Mlislsslppl stopped lestliv Idndeiiaiten 
' 'Kindei|«len: Piododog Early Palhne?' ' Principal, vol 69. May 199a pp. M. 

3 Werner Rogen and loy E. BhMnt* ' 'Oeofiia's Flist QnMie Beadineis Assessment? 1 he Histockal Per ipective»' ' paper presented at the 
annual oieetii« of file American Bdiicatk)oalRMeaidiAisoelad^ p. 3. 

^usan P. lysoo and loy B. Btouot, "The Oeoisia Klndeigarten Assessment PiogniQ: A State's Emphasis on a Developmentally 
AppfOptiateAsiessnent,'' paper pieseoied at the America 7. 
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and imaginative way. It is a creative and difficult 
concept to put into place, however, and requires that 
the teacher create assignments that take students 
beyond the surface of a subject. For example, one 
history teacher suggested: ''Under the old system, 
the question would be 'Who was the King of France 
in 800?' Tbday, it is 'How is Charlemagne important 
to your life?' ''^^ While the exhibition format could 
be an essay or research paper, it might also call for 
a Socratic dialog between student and teacher, an 
oral interview, debate, group project, dramatic 
presentation, or combination of multiple elements, 
partly in preparation for the more comprehensive 
senior exhibitions. Clearly, developing and evaluat- 
ing successful exhibitions can be as big a challenge 
to the teachers as it can be for the students to perform 
well on them. 

Exhibitions can also be competitions, some at the 
individual level, like the Westinghouse Science 
Talent Search, or in groups, like the Odyssey of the 
Mind, a national competition requiring groups of 
students to solve problems crossing academic disci- 
plines. Group competitions add group cooperation 
skills to the mix of desirable outcomes. 

One interesting group competition is the Center 
for Civic Education's "We the People . . /'program 
on Congress and the Constitution. It is a national 
program, sponsored by the Conunission on the 
Bicentennisd of the U.S. Constitution and funded by 
Congress. Students in participating schools study a 
specially developed curriculum and compete with 
teams from around the country. In the competition 
they serve as panels of "experts" testifying before 
a mock congressional committee. The cuniculum 
can be used as a supplement to American history or 
civics classes and has materials that are appropriate 
for three levels (upper elementary, middle school, 
and high school). The text centers on the history and 
principles of the U.S. Constitution. When students 
have completed the curriculum the entire class is 
divided into groups, each responsible for one unit of 
the cuniculum. Each group presents statements and 
answers questions on its unit before a panel of 
community representatives who act as the mock 
congressional conunittee members. Winning teams 



from each school compete at district. State, and 
finally a national-level competition. Training for 
judges at each level is conducted through videotapes 
and training sessions in which the judges evaluate 
each group on a scale of 1 to 10, on the criteria shown 
in figure 7-6* 

Experiments 

Science educators who suggest that students can 
best understand science by doing science have 
promoted hands-on science aU across the science 
curriculum. Similarly, they maintain that students' 
understanding of science can best be measured by 
how they do science — the process of planning, 
conducting, and writing up experiments. Thus, 
science educators are seeking ways to assess and 
measure hands-on science. A nui^ber of States, 
including New York, California, and Connecticut, 
have pioneering efforts under way to conduct 
large-scale hands-on assessments in science. 

In 1986, NAEP conducted a pilot project to 
examine the feasibility of conducting innovative 
hands-on assessments in mathematics and science. 
Working closely with the staff of Great Britain's 
Assessment of Performance Unit, 30 pilot tasks 
using group activities, work station activities, and 
complete experiments were field tested. School 
administratorc, teachers, and students were enthusi- 
astic and encouraging about these efforts. As part of 
the pilot project, NAEP has made available detailed 
descriptions of these 30 tasks so that other educators 
can adapt the ideas ."^^ A sample experiment used 
with third graders and scoring criteria are pictured in 
figure 7-7. 

New York Elementary Science Program 
Evaluation Test 

In 1989, the New York State Department of 
Education, building on the NAEP tasks, included 
five hands-on manipulative skills tasks as an impor- 
tant component o^ their Elementary Science Pro- 
gram Evaluation Tfcst (ESPET). Used with fourth 
graders, the test also included a content-oriented, 
paper-and-pencil component. It was the intent of the 



^"^James Charleson^ Hope High School Providence, RI, quoted in Thomas Ibch and Matthew Cooper, * 'Lessons From the Trenches/ * VS, News A 
World Report, vol. 108, No. 8. Feb. 26, 1990, p. 54. 

^^See Educational Ifesting Service, Learning by Doing: A Manual for Teaching and Assessing Higher Order Thinking in Science and Mathematics 
(Princeton, NJ: May 1987); or the full-repott, Fran Blumberg, Miuion Epsteh), Walter MacDonaldi and Ina Mullis, A Pilot Study of Higher Order 
^hinking Skills: Assessment Techniques in Science and Mathematics, Final Report (Princeton, NJ: Educational Tbsting Service, November 1986). 
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Figure 7-6-Scorlng Sheet for the ''We the People" Competition 



Student teams act as witnesses before a 'Congressional Committee' and 
answer questions on the U.S. Constitution (history, law, and current applications). 
Each grour is scored on a scale of 1*10 on the criteria listed below. 

1-2 • poor 3-4 « fair 6-6 • average 7*8 • above average 9-10 • excellent 



Score Notes 



1. unaersianaing: lo wnai extent aid participants 
demonstrste a clear understanding of the basic 
issues involved in the questions? 






2. Constitutional Application: To what extent d(d 
participants appropriately apply knowledge of 
constitutional history and principles? 






3. Reasoning: To what extent did participants 
support positions with sound reasoning? 






4. Supporting Evidence: To what extent did participants 
support positions with historical or contemporary 
evidence, examples, and/or illustrations? 






6. Responsiveness: To what extent did participants' 
answers address the questions aslced? 






6. Participation: To what extent did most group 
members contribute to the group's presentation? 






Group total 

kludge: Date: 







Congressional District: Tie breaker* 



'Please award up to 100 points for this group's overall performance. 
(Bonus points will only be used In the event of a tie.) 



SOURCE: Center for Civic Education, Caiabasas. CA. 



test designers to align classroom practices with the 
State objectives reflected in the syllabus."^ 

The manipulative test consists of five tasks, and 
each student is given 7 minutes to work on each of 
the tasks. At the end of each timed segment, the 
feicher organizes a swift exchange of desks, or 
stations, moving the front row children to the back 
of the column and the others each moving up one 
desk, somewhat like a volleyball rotation. Tfest 
stations are separated by cardboard dividers and are 
arrang(?d so that adjacent stations do not have the 
same apparatus. Four classes of about 25 children 
each can be tested conrfortably in a school day. The 
skills assessed by the five stations include measure- 



ment (of volume, length, mass, and temperature), 
prediction from observations, classification, hypoth- 
esis formation, and observation. 

The exanihations were scored by their teachers, 
but student scores were not reported above the 
school level. School cores were reported in terms of 
the items on which students had difficulty. The 
ESPET is currently being evaluated for use in other 
grades. 

Connecticut Common Core Science and 
Mathematics Assessments 

Connecticut has been a leader in the development 
of a set of mathematics and science assessments that 



^SaUy Bauer, Sandra Malhison, Eileen Mcrrlam, and Kathleen Tbmj, * uantroUii* Curricular Change Through Statc-Nfandated Testing- Tfcacber's 
Views and Perceptions/' paper presented at the annual meeting of the American Educational Research Association, Boston. MA, Apr. 177*1990. p. 7. 



chapter 7— Performance Assessment: Methods and Characteristics • 225 



call on group skills and perfoimance activities.^^ 
Under a 45-month grant from the National Science 
Foundation^ Connecticut has assembled teams of 
high school science and mathematics teachers woik- 
ing jointly on Connecticut Multi-^State Performance 
Assessment Collaborative Tfeams (CoMPACT). COM- 
PACT is made up of seven State Departments of 
Education (Connecticut, Michigan, MIl *^esota, New 
York, Tfexas, Vermont, and Wisconsin), CES, The 
Urban District Leadership Consortium of the Ameri- 
can Federation of Ibachers, and Project ReiLeam- 
ing. 

The Compact group has designed and devel- 
ope ' 50 performance assessment tasks, 31 across 8 
areas of high school science (biology, chemistry. 
Earth science, and physics) and 19 in mathematics 
(general or applied m^ematic;, algebra, geometry, 
and advanced mathematics). After pulling together 
the experiences of CoMPACT teachers trying out 
these tasks, Connecticut will convene conmiittees of 
expert judges to establish ''marker papers'^ and 
common scoring standards. These scoring standards 
will be used during 1991-92 on the first administra- 
tion of the Connecticut Common Core of Learning 
Assessments in high school science and mathemat- 
ics across the State. A key element of the entiie 
endeavor will be the assessment of student attitudes 
toward science and mathematics, and the demonstra- 
tion of teamwork and interpersonal skills in these 
real-life testing contexts. 

Each task has three parts that require individual 
work at the beginning and end, and group work in the 
middle (see figure 7-8). First, each student is 
presented with the task and asked to formulate a 
hunch, an estimate of the solution, and a preliminary 
design for a study. This portion of the task has 
several goals — ^it focuses the student^s preliminary 
thinking, becomes a springboard for student group 
discussion, gives the teacher a feel for where the 
students are in their thinkingy and serves as a record 
that the student can revisit throughout the assess- 
ment. 

The middle section involves the longest phase. 
Here students plan and work together to produce a 



group product; teamwork is emphasized throughout. 
Evidence of deepening understanding is recorded 
through a variety of assessment tools such as written 
checklists, journals, logs, or portfolios. Oral or 
visual records such as videotiqpes of group discus- 
sions and oral presentations are also maintained. 
Ibachers can rate individual performance on a 
subset of objectives in the group task. The ability to 
infer levels of individual contribution on collective 
work is one of the largest assessment challenges. 

The third part of the task consists of individual 
performance on a related task. These tasks consist of 
similar activities that attempt to assess some of the 
same content and processes as the group task. The 
transfer task provides each student with an opportu- 
nity to synthesize and integrate the learning that 
occurred in the group experience and apply it in a 
new context. It also provides teachers, parents, and 
policymakers with a summative view of what each 
student knows and can do at the end of a rich set of 
learning and assessment opportunities. 

Seveijl evaluaticms of the project have been 
completed to date. Tbacher percepticms are quite 
positive. Through the participation of the Urban 
Etistrict's Leadership Consortium, students in 16 
large urban school systems tried out the performance 
tasks during the 1990-91 school year, demonstrating 
the feasibility of this type of assessment in schools 
with large populations of AMcan-American and 
Hispanic students."^ 

Portfolios 

Portfolios are typically files or folders that contain 
a variety of information documenting a student's 
experiences and accomplishments. They furnish a 
broad portrait of individual performance, collected 
over time. The components can vary and can offer 
multiple indicators of growth as well as cumulative 
achievement. As students assemble their own port- 
folios, they evaluate their own work, a key feature in 
performance assessment. Proponents suggest that 
this process also provides students a different 
understanding of testing, with the following positive 
effects: 



^See Pascal D. F(vglone, Jr. and Joan Doykofif Baron, Connecticut State Department of Education^ Assessment of Student Performance In High 
School Science and Mathematics: The Connecticut Study/ ' paper presented at the Seminar on Student Assessment and Its Im,<^act on School Curriculum, 
Washington, DC. May 23, 1990. 

Q ^Joan Boykoff Baron^ Connecticut Department of Education^ personal communication, November 1991 . 



Figure 7-7— "Sugar Cubes": A NAEP Hands-On Science Experiment for 3rd Graders 



The Experiment 

Students are given laboratory equipment and asked to 
determine which type of sugar, granulated or cubed, 
dissolves faster when placed in warm water that is 
stirred and not stirred, respectively. To complete this 
investigation, students need to Identify the variables 
to be manipulated, controlled, and measured. They also 
need to make reliable and accurate measurements, 
record their findings, and draw conclusions. Examples 
of written conclusions are presented on the next page. 



NOT STIRRING 



SET-UP 



MEASUREMENT 




RESPONSE SHEET 



STIRRING 



SET-UP 



MEASUREMENT 



The Observation 



NAME: 

CODE: 

SCHOOL DISTRICT:. 



Sugar Cubes Behavioral Checklist 

1. Loose sugar tested 

2, Cube sugar tested 



3. Volume of water measured— by eye 

4. by ruler 

5. cylinder 

6. Volume used < 10 co 

7. Volume used>10cc 

8. Volume same for both types 

9. Mass same for both types 



10. No apparent measurement 

1 1 . Qualitative measurement 

12. Clock used 

1 3. within •»-• 3 sees, of start point 

1 4. within •»-- 3 sees, of end point 

1 5. Timed-until all dissolved 

16. until partially dissolved 

1 7. no dear end point 

1 6. Rxed time— notes amount remaining 



*1 9. Reports results consistent with evkience 



20. Stirring not tested— sugar type not controlled 

21. Loose sugar tested 

22. Cube sugar tested 

23. Stirring tested— by counting number of stirs 
24. by timing 

25. Stirring at regular Intervals 

26. Stirring rate— constant 
27. random 



28. Volume of water measured— by eye 

29. by ruler 

30. by cylinder 

31 . Volume used < 10 cc 
32. Volunie used>10oc 

33. Volume same for both types 

34. Masfa ame for both types 



35. No apparent measurement 

36. Qualitative measurement 

37. Ck)ck used 

38. within 4- 3 specs, of start poin^ 

39. within 4- 3 sees, of end point 

^n TtmAH^infll oIlHii^qolVed 

olved 
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Using detailed checklists, NAEP administrators recorded 

students' strategies fcr determining—with accurate and »8tent with evidence ^ 
reliable measurements— whether loose sugar or sugar bothtHais 
cubes dissolved at a f tster rate. ck findings 

sr minimal) 

*48. Acknowiedges that procedures could be Improved if 
experiment repeated— aware that certain variables 
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FIND OUT IF STIRRING MAKES ANY DIFFERENCE IN HOW 
FAST THE SUGAR CUBES AND LOOSE SUGAR DISSOLVE. 



B) Use the space below to answer the question In the box. 



Score received 



5 point answer 



yA r^iS^lZ^y^ . 



3 point answer 



Ti 

























1 point answer 



Scoring of Written Answers 

5 points = response states that both types of sugar dissolve faster but loose sugar dissolves the fastest. 

4 points = response states that the loose sugar dissolves faster than the cube and that stirring is tho cause of it. 

3 points = respoiise states that stirring makes a difference only or how or why an effect upon the sugar Is found only. 

2 points = response states that one type of sugar dissolves faster than another only. 

1 point = incorrect response. 

0 points = no response. 



KEY: NAEP • National AssMsment of Educational Progress. 

SOURCE: Educational Testing Service, Learning by Doing: A Manual for Teaching and Assessing Higher Order ThlnM^^ NJ:May 1987); and Fran Blumberg, 

Marion Epstein, Walter MacDonald, and Ina Mulils, A Pilot Study of Higher Order Thinking Skills: Assessment Technktues In Science and Mathematics, Final Report (Princeton, NJ: 
Educational TftRtlnn 5^rvlrfl NnvAmber lOflfiV 
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Figure 7-fr-Connectlcu; Schnce Performance Assessment Task: ''Exploring the IVIaplecopter" 

OVERVIEW: TTiistaskwas designed for high school physics d and includes both indivMuai and group work. Students 
study the motton of maple seeds and design experiments to explain their spinning f ilght patterns. Curriculum topics include 
lawsof motton. aerodynamtes, air raslstancSp and the useof models in explaining sdentif te phenomena. Equipment rieeded: 
maple seeds. papertx)ard, stopwatches^ and scissors. The suggested length of time for the task Is 3 to 5 dass periods. 



Part i: Getting Started by Yourseif 

1. Throw a maple winged seed up in the air and watch It 
'1k>at" down to the floor. Describe as many aspects of the 
motton of the pod as you can. You may add diagrams if you 
wish. 

2. One of the things you probably noticed Is that the seed 
spins at it falls, like a little helioopter Try to explain how and 
why the seed spins as It falls. 

Part II: Group Wbrk 
The criteria that will t)e usedto assess your wori( are found 
on the Ot^ecUves RaUng Form - Qroup. Each memt)erof 
your group will also fill out the Group Perfomiance Rating 
Form. 

1 . Discuss the motton of the winged maple seed with the 
members of your group. Write a descriptton of the motion, 
using the observattons of the entire group. You may add 
diagrams if you wish. 

2. Write down the variables that might affect the motton of 
the maple seed. 

3. Design a series of experiments to test the effect of each 
of these variables. Carry out as many experiments^ as 
necessary in order to come up with a complete explanation 
for the spinning motton of the winged seed. 

Using Models In Science 

4. Sometimes using a simplified model (or a simulatton) 
might help one to understand more complex phenomena. 
A paper helteopter, in this case, might serve as asimplif led 
model of the seed. 

a. Constmct a paper helicopter foitowing the general 
instructions in figures 1 and 2. 





b. Throw the paper helteopter in the air and ooserve its 
motton. 

c. Try changing various aspects of the paper heitoopter 
to test the effect of the variables your group chose. 

d. Experiment with different types of paper helicopters 
until you feel that you have a complete understanding of 
how the variables you klentif led affect the motton. 

e. Summarize your results with the help of a chart or a 
graph. 

5. Bar 1 on what youVe learned f rem the paper helicop- 
ters, design and perform addittonal experiments with the 
mapto seeds. 

6. Describe your greup's findings frem all your experi- 
ments. Raw data should be presented in charts or graphs, 
as appropriate and summarized by a short written state- 
ment. 

7. Now, after you have completed ail the necessary 
experiments, try to explain again the motton of the maple 
seed. Try to Jr^-'^ide in your explanatton the effect of all the 
variables that you observed In your experiments. You may 
add diagrams if you wish. 

8. in this activity you used simplif tod models to help explain 
a more complicated phenomenon. Describe the advan- 
tages and disadvantages of your paper helicopter as a 
rr^odel of a winged mapto seed. 

9. What are the btologicai adv&ntage(s) of the structure of 
the maple seed? Explain fully. 

Part III: Finishing by Yourseif 

THE GI^ND MAPLECOPTER COMPETITION 

Your goal is to design a helicopter, frem a 4'' X 8'' piece of 
paperboard, that will remain in the air for the longest time 
when dropped frem the same height. 

a. Design the "helicopter." 

b. Write down factors related to your design. 

c. Cut out the "helicopter." 

d. Mart< the helicopter with your name. 

e. Good luck and have fun I 



Figure 1 



Figure 2 



^OURCE: Connecticut State Department of Education, 1991. /3 3 
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GROUP PERFORMANCE RATING FORM 



student Name . 
Student ID « 



A. GROUP PARTICIPATION 

1 . Participation In group discussion without pronr 

2. Did his or her fair share of the work 

3. Tried to dominate the group - Interrupted ott* 

4. ParVclpated In th0 group's activities 



Often 



Somor 
: tintM 



B. STAYING ON THE TOPIC 

5. Paid attention, listened to what was belnp 

6. Made comments aimed at getting the grc 

7. Got off the topic or changed the 8Ut)|ect 

8. Stayed of the topic 



Thet 



(9) 



E 



C. OFFERING USEFUL IDEAS 

9. Gave Ideas and suggestions that helf 

10. Offered helpful criticism and comrm 

1 1 . influenced the group's decisions ar 

1 2. Offered useful Ideas 



3. 



D, CONSIDERATION 
13. Made positive^ encouraging rem 
- 4. Gace recognition and credit to c 

1 5. Made Inconsiderate or hostile r 

16. (Vas considerate of others 



E. INVOLVING OTHERS 

1 7. Got others Involved by asMr 

1 8. Tried to get the group workf 

19. Seriously considered the Ic 

20. Involved others 



F. COMMUNICATING 



21 . Spol<e dearly, was easy to hea* ^ 

22. Expressed Ideas clearly and effectively 

23. Communicated dearly 



*Podt^ 



9., 



""••fit 



'900C 



A Sample of Other Science Performance Tasks Under Development 

BOILING POINT LABORATORY: Students are asked to design and carry out a controlled experiment to determine the mixture of 
antifreeze arnJ water that has the h^^he^^t t>olllng point and Is thus the most effective In keeping cars running smoothly In extreme 
temperatures. 

OUTCROP ANALYSIS: Students are given a variety of Informatloni Including videotapes, pictures, and rock samples, from a site In 
Connecticut and are asked to determine if it Is a good site on which to txilld a nuclear power facility. Students may be asked to 
Investigate other factors, such as population, waste disposal, weather, politics, etc. In determining If It is a good site. 

WEATHER PREDICTION: Students asre asked to predict the weather based on their knowledge of meteorology, data thay collect, 
and ot)servatJon8 that they are able to make. Students may be asked to make simple weather Instruments or create a weather 
Q ' recasting segment as It would appear on a television newscast 

^ 210 



230 • Testing in American Schools: Asking the Right Questions 



• testing becomes a personal re^'ponsibility; 

• students realize that they need to demonstrate a 
full range of knowledge and accomplishmrats, 
rather than a one«shot performance; 

• they begin to leam that first draft work is never 
good enough; and 

• they appreciate that development is as impor- 
tant as achievement.'^'^ 

A small but growing number of States have 
embraced portfolios as an educational assessment 
tool. As of 1991, five States (Alaska, Califomia, 
North Carolina, Rhode Island, and Vermont) had 
implemented portfolios as a mandatory, voluntary, 
or experimental component of the statewide educa- 
tional assessment program. Four additional States 
(Delaware, Georgia, South Carolina, and Ibxas) are 
considering implementing portfolios for this pur- 
pose. At the State level, portfolios have been 
implemented mostly in mathematics and writing at 
grade levels ranging from 1st to 12th but concen- 
trated in the early grades.^ The Veraiont experience 
with portfolios is noteworthy (see box 7-F). Michi- 
gan's portfolio project, begun on a pilot basis in 22 
districts during 1990-91, focuses on the skills that 
high school graduates are expected to have in order 
to be productive workers. As described in box 7-G, 
this use of portfolios aims at providing both students 
and prospective employers with information on 
workplace skill competencies. 

Research on effectiveness of portfolios is being 
assembled by the project Arts PROPEL, a 5-year 
cooperative effort involving artists, researchers from 
Harvard University's Project Zero, the Educational 
Testing Service (ETS), and teachers, students, and 
administrators from the Pittsburgh and Boston 
public school systems. Supported by a grant from the 
Rockefeller Foundation, Arts PROPEL seeks to 
create a closer link between instruction and assess- 
ment in three areas of the middle and secondary 
school curriculum: visual arts, music, and imagina- 
tive writing."^^ The priiiiiAry purpose of the assess- 



ment is not for selection, prediction, or as an 
institutional measure of achievement. Instead, it is 
focused on undcfstanding individual student leam- 
ing as a way of improving classroom instruction. 
The goal is to creuce assessments that provide a 
learning profile of the individual on as many 
dimensions as possible, as well as showing student 
change over time.^^ The two sources of assessment 
are portfolios and what is called the ''domain 
project," an instructional sequence that focuses on 
central aspects of a domain and provides opportuni- 
ties for multiple observations of the student. Domain 
projects function as self-contained instructional 
units central to the arts curriculum, and are graded by 
the classroom teacher. 

The portfolio is the central defining element in 
Arts PROPEL. It is intended to be a complete 
process-tracking record of each student's attempts to 
realize a work of art, music, or writing. It also serves 
as a basis for students' reflection about their work, 
a means for them to identify what they value in 
selecting pieces for inclusion, and a vehicle for 
conversations about that work with teachers. A 
typical portfolio might contain initial sketches, 
drafts, or audiotapes; self criticisms and those of 
teachers and other students; successive drafts and 
reflections; and examples of works of others that 
have influenced the student. A final evaluation by 
the student and others is included, along with plans 
for successive work. Researchers and school district 
personnel are attempting to find methods of assess- 
ing artistic growth and of conveying this information 
effectively — through scores or other summary indi- 
cators — to administrators, college admissions offi- 
cers, and others. 

Like writing assessments, the use of portfolios is 
not new. For 19 years it has been the major 
component of the Advanced Placement (AP) studio 
art examination, administered by ETS^^ (see box 
7-H). 



^^From Dcnnie Palmer Wolf. ••Portfolio Assessment: Sampling Student "Wotk,** EducaHonal Uadership, vol. 46. No. 7. April 1989. pp. 35-36. 
^OTAdata. 1991. 

^^Ilobcrta Camp. ••presentaUon on Arts PROPEL Portfolio Explorations.** paper presented at the EducaUonal Ibsling Seminar on AltcmaUves to 
Multiple-Choice Assessment. Washington, DC. Mar. 30, 1990. p. 1. 

»i>rew H. Oitomer. ••Assessing Artistic Lcamii^g Using Domain Projects.** paper presented at the annual meeting of the American Educational 
Research Association, New Orleans. LA. April 1988. p. 4. 

Q *^Mitchell and Stempel. op. cit.. footnote 2. 
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Aft credit: Dennis Biggs, gmch ft Pittsburgh PMfc Schools 

Portfolios of Student work provide an c^;golng record of progress 
and the development of skills. TTiese ptotures were drawn by a 
student In Pittsburgh's Arts PROPEL program. Each portrait was 
completed In 3 minutes using a black felt-tip pen. The first Is a 
contour drawing of a classmate, the second is a portrait of the 
same student using all circdar lines, and the third Is the same 
student using only lines drawn with a mler. 



Common Characteristics of 
Performance Assessment 



Although there is great variety in the kinds of 
measures that fall under the umbrella of pof ormance 
^<!5!esSinent, certain common characteristics distin- 
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guish their use and implementation in school sys- 
tems. 

Performance tests require student-constructed 
respouises as opposed to student-selected responses. 
While it is not certain that these two responses 
involve different cognitive processes, creating a 
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Box 7-F— 'niiis is My Best'': Vermont's Portfolio Assessment Project 

Wet to 1990, Vennont was one of the few States with no mandated ttatewide testing prognmi. Districts could 
conduct stan<!;adized noim-feferenced testing fttf thek 0^ 
legislature iqn*oved fbnds for a statewide asiesanwot 
Iriece of the idisn, pik)(ed in tibe 1990-91 sdwd year in one-quaiterttf 

and madwmatics in grades four and eight Eventually all the n^|or academic disciplines will be covered. Each 
assessment has tfuee parts: a uniform test, "best pieces" exen^lifyhig die student's hi^»8t adiievement in the 
judgment of the student and teacher, and a portfolio showing development tfuonghout tfie year. 

The mathematics assessmeat includes a standardized test that contains muMplc-choice, open-ended, and 
longer conqwtational problems. Eadi student is also leqxnsifele for assembling a mathematics pcHtfolk), a 
collection of some 10 to 20 entries <rf problems and projects completed. Five to seven of tfiese are pk^ die student 
and teachers have chouen as best pieces, aocompaded by a letter the student writes to the evahiatw 
these were sdected. All this oonfeiring, questioning, reviewing, and writing about madiematics is aimed at better 
understandhig and comuuni^vtion about mathematical reasonmg, logic, and problem sdving. The mathematics 
pQitfolios are designed to foster an attitude ot responsibility for learoiog on die part of die studem, reveal die 
studem's feelings about madiemadcs, and provide a means of showing growth in areas not well suited to 
standardized tests^ (see figure 7-Fl). • 

The writing assessment is made up of a unif <»m writing pnmipt and an interdisdpUnaiy writing pwtfolio.^ Ihe 
writing assessment is shnilar to tint used in ote States, widi students given a un^ 
The students are encouraged to dtink dirough ideas firrt and write rou^ drafts, 
(xrovided in die testing room, and dien produce a finished product The pron^ used fiv die 19^ 

Most people have tttong fiMlings dxwt Mmediiog diat hafipen^ 

fdt hi^, icaied, toiprised, or proud lUl about this time so thitt the nadex will undenUnd what hiq)pened» who 
was invdved, how die aq>eiience made you fisel, and why it was inp^^ 

Students also answered 12 ^noal infimnation questions diat aoconqpanied die writing assessment Their 
responses were correUdedto levels of writing petforaumce and illuminated several issues die State found unportant 
These mcluded: die negative hiqiacts of television viewing, positive effects of reading, and siqiport for teaching of 
writing as a process and writing across die curriculunL Ihe analysis was conducted by 
responsible for scoring die unifonn writing assessments. 

Ilie writing portfolio can contahi pieces itom grades prior to die fburdi and dghdi grade "snaps^ 
worics in various stages of revision; several otiher writing sanqiles, inchidhig a poem, short story, play, or personal 
nanation; a personal response to an event, exhibit, book, issue, madmnatics problem, or scientific phenomenon; 
and i«08e pieces fiom any cunicular area outside of English. As in die maAonatics portfolio, die student also 
chooses one best piece, and writes a letter to die evaluates explahiing ^y die piece was selected and die process 
of its composition. 

The writing portfolios are scored by teachers. In die pilot year, appnnhnately ISO fourdi and eighdi grade 
teachers from die sanqile schools did dii^ scoring. Eadi portfolio and best piece was assessed by two teachers (using 
die writing benchmarics shown hi table 7-Pl) and die process took 2 days. Aldiough it was an hitense experience, 
die teachers' reactions were generally positive: 

. . . deq>ite die woris load, this was an faivigonting and inq)iring cotq>te of days. A few tilings inqxesaed me: the 
unifonnity of die grading; die joy of discovniiig various "nuggets" of good stuff; die variety uid the quality of ei|^ 
grade writing. 

I learned a bell of a lot. The eicpeiienGe ooofinned the pievailiqg sense among the writfaig community diat lunguagc 
can be die close, personal ally of every self, legaidless of ability, «ge, or station. 

What was most usdul about Ads process was that teacfaen fiom all over the stale saw the variety and tdked thaoi iL* 



Womont DqMitmeBt of EducMloii, Lo(Mng Beyond ' 'The Aiuwei^ ' : Vermont's Mathemaa Portfotto A uetsment Program. Pilot Year 
Report 1990-91 (MoolpeUer. VT 1991). 

2S«e Vcimoat Depaftmem of Edncatloo. "rW* /* My Beit": Vermont's Writing Assessment Program, Pilot Year Report 1990-91 
(Moaqidier. VT: 1991), p. 7. 

3lbld..p.l9. 

4lb(d.. pp. 13-14. 
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Figurt 7-Fl^rtfullos m a "Window" on Studtnt Fotllngt About ItathMMtlct 

Studwrtt kMp (»piM of their mtthQiTwtlci problems as wel M 
mathematics in their portfolios. This student's current f njstration Is rofleoted In his entry: 

]r'«f'^/>' ^i^A c< hao' f^cod ^i- Tj'^^^ 5^^ ^ V^o, /Hy /?^/^ 

Uter In the year, he was fac^ wim the foUowins 

lnagn)apoto(m9mHichk*Bm,ihBnumberolle^^ 
Whai$ttwl00$tnuni)9rofoow$andchl^ 

What followe le hie solution, and his reaction, In what he oaUed his "opinion comer": 



C^tn Or} 




OfihjOr) Corner 

M^i^ {/I 0( (^y Sof+-. Of 

. -Hi^^K^hcf ^. ... _ . 

— J)ih(^{^l^.^J^}aJ^'e_j2CaU^ 

SOURCE: Vwmonl Dapwtmnt of EduMikm, LooUng Bfond '77i# Aimw§r: Mmwrrt^ tUmmtMttoa PortMo 
AmmwtmtPmgnm, PM VMr/?^ 1980-91 (Montp«ll«r VT: 1901), p. 31. 

Corrt/nuetf on next 
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Box 7-F— "This is My Best": Vermont's 

In 1991-92, all Vermont schook are required to 
use the assessnients in the target grades. Local teachers 
will assess the writing portfolios in their own schools, 
after a series of prcMfessional development sessions. 
They have die option of working alone, assessing only 
their own students* portfdios, os woridng coopeia- 
tivcly with other teachers hi tfieir schools. In late spring 
they will bring a sample cf five poftfolios to a regional 
meeting, where teadiers from others schools will score 
their sample pcMtfolios to detennine a rate of reliabil- 
ity. A sam[de ci p(Hlf olios ftom each regional meeting 
will be asM»sed at a statewide meeting to ensure that 
common standards are i^pplied statewide.^ Aware of 
the hnpoitance of trainfaig teadiers to use new 
assessment tools as levers for ins' :4ional change, the 
State has ccmunitted 40 percent of die assessment 
budget to professional development^ 

The reporting system has also been careftilly 
considered Buildiqg on Vermont's tradition of town 
meetings, each district declares an annual Vermont 
School Report Day each spring. At diis time commu- 
nity members and the press go to thdr schools for an 
analysis d assessment resuHs and to discuss Ae 
district's resp<Hise to a list of questions prepared by the 
State board to encourage discussion about local 
schooling goals and successes. 



^n>ki.,p.8. 

^Rott Btewer, pideotstiaa at "Educatiaiul Aueumeot for 
the Tweoty-I%ft Cttilttfy. 1^ 

Nitiooal Center for Resetich on BvshiiSioo, Stsndaids and Stddeot 
1bstiJW« Manhattan Beach, CA, Mar. 9, 1991. 



Portfolio Assessment Project— Continued 

"nible 7F-1— Vermont Writing AtMtsment Analytic 
AtaMsmtnt Quido 

Five dbnensions of wittino are rated on the foNowIno levels of 
performance: extensively, frequently, sometimes, rarely (criteria 
for each of these are listed) 

Purpose The degree to which the writer's 

response: 

• establishes and malntioJns a dear 
purpose! 

• demonstrates an awareness of 
audience and task; 

a exhlt)ltsciarlty of ideas. 

Organliatlon Thedegreetowhichthewriter'sresponse 

Illustrates: 

• unity; 

• coherence. 

Details The deo?ee to which the details are 

appropriate for the writer's purpose and 
support the main polnt(s) of the writer's 
response. 

Voice/tone TtMdegreetowhlchthewriter'sresponse 

reflects personal Investment and 
expression. 

Usage, mechanics, 

grammar Thedegreetowhlchthewrlter'sresponse 

exNbits correct: 

• usage (e*g., tense formation, 
agreisment, word choice); 

• meohanioe^-^ling, capitalization, 
punctuation; 

• grammar; 

• sentences; 

as appropriate to the piece and grade 

SOURCE: Vamiont Stata Board of Education, Ihk 1$ MyB09t*: Vmionr$ 
VmigAm9$$mmHFmgrm,PMYmn$pprt fMKNIf (Montpa- 
Ilar,vr:ig91),p.e. 



response may more closely approximate the real- 
world process of solving problems. Most perform- 
ance tasks require the student to engage in a complex 
group of judgments; the student must analyze the 
problem, define various options to solve the prob- 
lem» and conununicate the solution in written, oial, 
or other forms. Furthermore, often a solution re- 
quires bdancing ''tradeoffs'' that can only be 
imderstood when the person making the choices 
explains or demonstrates the rationale for the choice. 
Performance assessment tasks make it possible to 
trace the path a student has taken in aniving at the 
chosen solution or decision. 

Performance assessment attempts as much as 
jMissible to assess desired behavior directly, in the 
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context in which this behavior is used. Tasks chosen 
for testing must sample representatively from the 
desirable skills and understandings: demonstrating 
ability to write a persuasive argument might be 
reflected in asking students to write a paragraph 
convincing the teacher why an extension is needed 
on an assignment; demonstrating an understanding 
of experimental design might involve design' ig and 
conducting an experiment to fmd du^ if sow bugs 
prefer light over dark enviromnents; suo' ing one's 
facility with the French written language might 
involve translating a French poem into English. In 
each of these cases, it is possible to conduc l other 
kinds of tests that can accompany the pcriormance 
task (e.g., vocabulary tests, lists of procedures. 
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Box 7 c:~Michigan^s Employability Skills Assessment Program 

In an efifoit to ensure that Michigan's high school graduates acquire skills necessaiy to remain competitive in 
an increasingly technological wod^lace, the Governor's Conmiissi(» on Jobs and Economic Devdqmient 
convened the Employabi% Skilb Tsuk Force in 1987. The Thsk Foice, made up of leaders fiom business, labor, 
and education, was diaiged with identifying the skills Michigan employers believe impoitant to succeed m tfie 
modem workplace. The Thsk Force concluded that Michigan woikers need skills in three areas: 

• Academic skUls, such as the ability to read and understand written materials, charts, and gr^riis; 

• Teamwork skUls^ such as die abili^ to express ideas to colleagues of a team and compromise to accomplish 
a goal; and 

• Personal Management sldlh^ such as die ability to meet deadlines and pay attention to details.^ 

The TEisk Force also served as a policy advisoiy group on the development of Michigan's Employability Skills 
Assessment Program for the State's high schools. The Iksk Foice conchided that student poitfolios would best 
describe the strengths and weaknesses of individual students in the skill groups, and could serve as the basis for 
planning an individual skills development program for each student 

The portfolio program was piloted during die 1990-91 school year in 22 school districts. Districts were 
encouraged to apply the program to a cross secticm of students in order to emfriiasize that the program was designed 
for everyone, not just noncollege-bound youth. 

lb help students, the State provided several tools including three portfolios (one in each skill area), a portfolio 
infonnatim guide fixr the student, a parent guide for the student's parents, a personal rating form to be filled out 
by students, teachers, and parents, and a work iq^praisal form for miployers to complete. 

Each of die three portfolios. Academic, Tbamwork and Personal Management, stresses skills ccHisidered 
important in that particular area* Students are responsible for ^ipdating their portfoUos witfi sample work and 
information about grades, awards, and recommendatims. For example, the captain of the school track team mi^t 
ask her coach for a letter of recommendation to place in her Tbamwork portfoUo as proof of her leadership ability* 
If students feel th^r are lacking in a particular skill category, they can seek out an activity designed to help them 
master tfiat skill. In this way stuuents are expected to discover ^ develop, and document their ' ^employability skills. ' ' 
It is envisioned that the portfolios will serve as ''resume builders."^ When applying for jobs, studenti will use their 
portfolios to demonstrate employability skills. 

It is difficult to assess the results of the Employability SkUls Assessment since the program is so new. The few 
collected respcxises have been mixed. Schools diat have taken die program to heart, contacting local businesses and 
ififonning tfiem of die program, have been entfiusiastic. Some schools have even invited local business managers 
to assess individual student's portfolios. Odier schools, however, have been less satisfied. Some are resisting 
suggested changes because diey appear incompatible widi odic^ reform efforts; others are hesitant to involve 
business in what is viewed i»imarily as die job of die schools. Michigan law now requires every school to design 
a portfolio system to assess nhidi graders beginning in the 1992-93 school year. The State's Department of 
Educaticm plans to continue piloting the Employability program. 



y U similar emiduuts on the UoDd of acad^^ 
U.S. DeiHutiiieiit of Ubor» Secittaiy's Commission on Achieviqg Necessary Skills (SCANS), What Work Requires cf Schools (Washiiigton, 
DC: June 1991). 

^Edward D. Roeber, Michigan Depaitmcnt of Education, personal communication, Oct. 22, 1991. 



questions about content), but in perfonnance assess- 
ment direct performances of desired tasks are 
evaluated. 

Performance assessments focus on the process 
and the quality of a product or a performance. 



Effectiveness and craftsmanship are important ele- 
ments of the assessment; getting the ^^right answer'* 
is not the only criterion.^^ The process as well as the 
results are examined in solving a geometric proof, 
improving one^s programming skills, or formulating 
a scientific hypothesis and testing it. 



52Grant Wiggins, "Authentic Assessment: Principles, Policy and Provocations* ' ' paper presented at the Boulder Conference of State Tfesting Directors, 
Q ilder,CO, June 1990. 
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Box 7-H— -Mdvanced Placement Studio Art Portfolios^ 

While die idea of poftfolios for large-scale testing is considered a novel idea, poitfolios have been the heart 
of the Advanced Placement (AP) examination for studio m 

FMfolio Evaluation is to cetHty that a high school student has produced woAb that meet die achievment level 
expected ci first year coU^e students in studio ait The cost to the student is $65, the same as for qtiher AP 
examinations. There aie several points diat make the assessment of particular hiteiest: 

• The assessment is conducted entirely through evahiation of die woik contained inUe student portfolio. 
There are no essays, no questions to answer* no standard paper-and-pencil examination. 

• It is a considered a * 'high-stakes ' ' assessment, fkw, like all AP examinations, students must receive a passing 
grade (a score of 3 or hitler on a 1 to 5 rankmg) to em 

• I>eq>itediefiu^duitthetopicisa**subjecthre**onelikeart«adni^ 
conducted in an objective manner. 

• There is no set cuniculum; teacheis have great flexibility in dieir choice of ^proach, organization, 
yissignments, and so fonh« 

• A high d^ree of student initiative and motivadon is required. 

• The program has won die reqiect of teachers and students at bodi die high school and college level and diere 
is little controveisy surroundhig it. 

Standardization of Portfolio Submissions 

Students submit a portfolio based on the woA they have created during the year-l<Mig AP studio art course.^ 
A student can choose one of two evahiations: die drawing portfolio or general portfolio evaluation. In the drawing 
por(f<7//Oi diere must be six orighial w(»ks no bu^ 16 inches t>y 20 inches, andfitom 14 to 20 slides on an 
area of special concentration. The concentration is a single dieme (e.g.* self pmtraiture) develqied by die student. 
Some of the concentrations chosen as exen^laiy in rcceut years have included cubist still-life drawings, 
manipulated photographs, wood relief sculntures* still lifes transfonned hito surreal kuidscapes, and expressionist 
drawings diat serve as social comm^taiy.^ Anodier 14 to 20 slides illustrate breaddi. The general portfolio is set 
up in much die same format^ Fibn and videotapes may be submitted in die concmtration secticm. 

Standardizing Artistic Judgment 

In June 1991, neariy 5»000 portfolios were submitted fa die evaluation. These weie graded by a panel of 21 
readers (scorers) assembled at Trenton State College in Ttenton, New Jersey. The readers all teach eidier AP studio 
art or analogous coU^e courses; scoring took 6 days. 

Each grading session began widi a standard-setting session. A number of portfolios were presented to die 
assembled readers, rou^y illustrating all die possible scores. These examples were chosen befiuehand by die chief 
reader for the whole evaluation and die table leader for each section; dieir selection and judgment were guided by 
their experience of teaching. There was no general scoring rubric per se; no analytic scales of primary traits as there 
are in the evaluation of writing. As one former chief reader suggested: 



^Mucfa of this discossioii comet from Ruth Mitchell and Amy Stempel* Council for Basic Education^ * *Six Case Studies of Peifomiance 
Assessment,*' aiA contractor report, March 1991. 

^Studio art was added to tfie Advanced Placement (AIO program in two ^ 
in 1 980. A separate AP an histoiy course is also offered; its eumination has a mofc typical fbimat of mult^^loK^hoice and fiee-response items. 

H:olleges have varying polices regarding AP creditt. Some grant exemption from freshman-level courses, while others require students 
to take the introductory courses, but gnuu a certain number of elective credits. In general, students can reduce the number of courses require 
to graduate irom college by patting these AP coll^o-level courses in high school. Thus tiieie is a strong Anancial incentive to succeed on the 
AP examination. 

^Not aU sdiooto offer a separate AP course. A separate AP studio art course is *'ahnos^ 
AP students work alongside other students in regular chisses, while other studentii submit work done indepeodenUy during the summer or in 
museum courses. Alice Sims-Ounzcnhauser, Educational Ibstiog Service consultant, AP studio art, personal communication, November 1991. 

^Only four works are required in tiie orighial work portion. The breadtti section specifies ttmt eight slides illustrate drawing skill, with 
four each in three other categories (color, design, and sculpture). 
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Factors th«t are included in auesaing quality include imaginsdon; firethneu of concq)tija and inteipreudoD; tnasteiy 
of concept!, coinpotidoa, materials, md tedmiques; a distinct sense of order and fntm; evidence of a range of 
experience; and, finally, twareaeM of att-hinorical lources, indoding coatetiq>oraiy aitista and an movements. It ia 
not ejqwcted that every stodem's portfolio will reflect aUofthese considerations to . . .What you're 

really after is amind at work, an hiterested, Uve, tfainldng being. You 
from long e]q)eiience and you intuit it.7 

In commenting on how this q^proadi related to judgments in other disciplines, he noted: 

Tliere are mote things that Join us togeAer than separate us. You can make those judgments as accurately as 
you can in m a th e m a t ics ot in writing or in any other subjects. These other subjects fiequently have mudi mote 
difScdty than we do in die visual aits in agreefaig on staudaids. ... You get a sense for copied work, a sense 
there's engagement, when faispiration, belief, direct involvement are present or absent.* 

The portfolios chosen to exemplify each grade remained on display tfuoughout the scoring as references f(x 
comparison. The readers assigned scores to each part sqMuately, on a scale fitom 1 to 4. Originality of work was 
scored indqiendently by tfuee readers; concentrations and breadth 1^ two readers. The scores were manipulated by 
computer to arrive at a raw score (1 to approximately 100) to which the three sections (original w(»dc, concentration, 
and breadth) contribute equally. If disoepancies of 2 or more points between two readers* evaluations of the same 
section occurred, the dii^ reader reviewed the section and reconciled the scores. The chief reader might speak with 
a reader and use the models to reinforce the agreed standard. 

After all portfblios had been evaluated, cutoff scores were determined and the total scores then converted to 
the AP grades on a scale of a high of S to a low of 1 . Although assigning the cutofif scores (le., detennining the lowest 
total score to receive an AP grade of 5 on down) is the chief reader's responsibilily, there was input from a long 
debriefing meeting of all reactors and from statistical information supplied by the ocMi^uter, historical data regarding 
previous years* cutoff scores, composite and raw 8C<»es present year's candidates, and tables showing the 
consequences of chooshig certafai cutoff scores, in terms of percentage of students receiving 5, 4, \ and so on. The 
scores overall were roughly distributed in a bell curve, with roost receiving a 3, but f^wer Is than Ss. (Colleges do 
not usually accept either 2 or 1 scores, so a 2 can perform the same function as a 1 (le., denying the awarding of 
college credit) without making such a negative judgment of a student's work) 

Impacts on Students 

In the process of creating portfolios for AP studio art, students begin to develop artistic judgment about their 
own wodc and that of tiheir fellow studmts. Students are taught to crittoize eadi other's work constructively. As diey 
leam how to select wmks for dieir own portfolios, they also leam to communicate widi eadi anodier about areas 
that need improvement This climato of reflection is an important byproduct of portfolio assembly. 

Another key factor is motivation. As one teacher suggested, the course is a test of students' self motivation.' 
For example, students must have die ability to envision a concentration project and then work steadify toward 
completing it for 8 or 9 nKmths, sohrmg problems as they arise. The woric on all three sections must be timed so 
that the entire portfolio is ready at the deadline. Pieces have to be photographed for slides and final selections made 
for the collection of original works. 

Broad Public Acceptance 

Another important point is the relative lack of cmtroversy surrounding judgment of a subject traditimuUly 
considered subjective. T^is respect comes from the long history of the evaluation and the refinements the 
Educational Ibsting Service has made to the jury method of judging works of art, based on collective, but 
independent, judgments by teachers who are involved in the day-to<lay teaching of students like those being 
assessed. These teachers are well trained in the objectives of the course as well as the performance standards for 
each level, and their judgment is vahied and respected. 



'Walter Asldo, Evatuating the Advanced Placement Poifolto in Studio Art (Princeton, NJ: Advanced Flacement Program, 1985), p. 25. 
^Raymond Campeau, AP studio art teacher in Bozenuu. KfT, in Mitchell and Stempel, op. cit., footnote 1 . 
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The product or record of a performance assess- 
ment is scored by teachers or other qualified judges. 
In classroom testing this observation is done by the 
teacher, but in large-scale assessments, products, 
portfolios, or other records of work are scored by 
teams of readers. How much psychometric rigor is 
required in making these qualitative and complex 
judgments varies with the purpose of the assess- 
ment; less rigor is acceptable for use within the 
classroom for diagnostic puiposes than would be 
acceptable in large-scale testing programs where 
comparability is essential. What is important is 
that performance tests are not ^^beyond stand- 
ardized testing"; they should be standardized 
whenever comparability is required.^^ 

The criteria for judging performance assess- 
ments are clear to those being judged. Criteria for 
judging successful performance must be available 
and understood by teachers and students. The tasks 
and standards must allow for thorough preparation 
and self-assessment by the student,^ if the test is to 
be successful in motivating and directing learning, 
and in helping teachers to successfully guide prac- 
ticed^ The goal in performance assessment is to 
provide tasks that are known to the student- 
activities that not only can but should be practiced. 
Performance assessment tasks are intended to be 
^'taught to,'' integrating curriculum and assessment 
into a seamless web. Practice required for good 
performance is understood to increase and stimulate 
learning. 

Performance assessment may take place at one 
point or over time. Typically it exammes patterns of 
student work and consistency of performance, look- 
ing at how an individual student progresses and 
develops. This is particularly true of portfolios, 
which are collections of student work over time. 

While multiple-choice and other paper-and-pencil 
examinations are almost exclusively taken by an 
individual student, some performance assessments 
can be and are often conducted as group activities. 
This group activity reflects increasing interest in 
student team work and cooperation in solving tasks 
as a valued outcome of the educational process. 
Proponents suggest that, if teamwork is a valued 



skill, it should be assessed. However, the problems 
associated with inferring individual effort, ability, 
and achievement from group performances are 
significant* Individual p^ormance and perform- 
ance as a member of a group are often scored as two 
separate pieces of the assessment. 

Performance assessments are generally criterion- 
referenced, rather than norm-referenced. Al- 
though it is important to collect information on how 
a wide range of students respond to performance 
assessment tasks, die primary focus is on scoring 
students relative to standards of competence and 
mastery. Developers of performance assessment are 
seeking test-based indicators that portray individual 
performance v^th respect to specific educational 
goals rather than those that simply compare an 
individual's performance to a sample of other test 
takers* 

Performance Assessment Abroad 

The standardized, machine-scored, norm- 
referenced, multiple-choice tests so common in this 
country for large-scale testing are rarely used in 
other countries. In fact, these are often referred to 
generically as ''American tests." Instead, examina- 
tions like die French Bac, the German Abitur, or the 
English General Certificate of Secondary Education 
or **A levels," generally require students to create 
rather than select answers, usually in the format of 
short-answer or longer essay questions or, in some 
cases, oral examinations. These examinations share 
several of the characteristics noted above regarding 
performance assessments in American schools: they 
are typically graded by teachers, the content is based 
on a common curriculum or syllabus for which 
students prepare and practice, and the questions are 
made public at the end of the examination period. 

It is important to note, however, as discussed in 
chapter 4, that these tests are most commonly used 
for selection of students into postsecondary educa- 
tion rather than for classroom diagnosis or school 
accountability. Consequently, several of the charac- 
teristics noted in American performance assess- 
ments are not present in these examinations. That is, 
the examinations are usually individual assess- 
ments, with no opportunity for group activities; they 



"Frederick L. Finch, •"Ibward a Dcfinidon for Education il Pcrfonnance Assessment,*' paper presented at the ERIC7PDK Symposium, Alternative 
Assessment of Pcrformiincc in the Language Arts, Bloomington, IN, Aug. 23, 1990. 

^Wiggins, op. cit, footnote 52. 
Q "Center for Children and Tfechnology, op. cit.. fov>tnote 5, p. 3. 
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do not involve self assessment or student involve- 
ment in evaluation; the examinations are timed 
rather than open-ended; and» even when adminis- 
tered over several days, they do not involve tasks 
that take several testing periods or longer time 
periods to complete. 

Nevertheless, European e:iq)erience can be infor- 
mative. For example, the national assessment in 
Holland stmctures performance-based assessments 
for students by designing comprehensive problems 
for the year-end examinations. A committee of 
teachers in art history, for example, selects a 
unifying subject (e.g., ^ ^revolution''). Students are 
provided widi infomiation packages to guide their 
study of art throughout the year in ways that help 
them to critically develop the theme (e.g., readings 
and lists of museums). Ibachers are encouraged to 
work with students to help them develop individual 
interpretations and points of view. This assessment 
approach supports students in doing individualized 
in-depth work in a context of shared ideas, proce- 
dures, and problems.^^ 

The United Kingdom is the furthest along of 
European countries using performance assessment 
for national testing. The Education Refomi Act of 
1988 set in place a national curriculum, which has at 
its core a set of attainment targets for each of the 10 
foundation subjects to be taught to all students. 
These statements of attainment provide the basis for 
the criterion-referenced assessment system. Tbach- 
ers have been given detailed, clearly de^ed Stand- 
ard Assessment Tasks (SATs) to use with all 
students at or near the completion of four levels or 
**key stages'* of schooling: ages 7, 11, 14, and 16. 
Each SAT carries with it levels of attainment and the 
tasks for determining levels, described in manuals 
provided to all teachers. The tasks involve one or 
more components of every aspect of performance: 
reading, writing, speaking, listening, investigating, 
demonstrating, drawing, experimenting, showing, 
and assembling. The tasks were developed through 
research conducted at schools across the United 
Kingdom by the National Foundation for Education 
Research in England and Wales. 

Following a 2-day teacher training period, !hree 
sets of SATs were piloted in May 1990, testing 6,219 



students in level one (age 7) in schools throughout 
England and Wales. Each was constrocted around an 
overall theme hoped to engage the interest of 
7-year-olds: Ibys and Games, Myself, and The 
World About Me. 

Evaluation data and recommendations reflect 
widespread concem with the extremely detailed and 
directive nature of the assessment system: 

In view of the issue of time and workload ... an 
inescapable conclusi<m must be that future SATs 
should be significantly s^'^er than those piloted. 
SEAC [School Examinations and Assessment C^^iin- 
cil] are likely to recommend that the SAT is to be 
carried out in a three week period, and to take not 
more than half the teacher's time during those three 

weeks The number of activities that can be fitted 

in will need to be reduced to about six in order to be 

sure that these time constraints can be observed 

The model of a SAT covering all or most, or even 
half of the ATs has now been proven to be 
unworicable in light of the number and nature of ATs 

included in the final statutory orders The SAT 

should still offer teachers the opportunity to embed 
the assessments within a coherent cross-curricular 
theme.^*^ 

How far the United Kingdom will be able to move 
forward on this ambitious assessment plan that 
requires so much teacher time is still under debate. 
However, the close tie to the national curriculum 
strengthens the likelihood that the SATs will be 
maintained as centerpieces for assessment. 

Finally, some countries are experimenting with 
the use of portfolios for large-scale testing activities, 
and many are looking to the United States for 
guidance in this field. Because the United States is 
widely respected as a leader in psychometric design, 
many other countries are watching with interest how 
we match psychometric rigor to the development of 
performance assessment techniques. 

Policy Issues in Performance 
Assessment 

Wious direct methods of assessing performance 
have long been used by teachers as a basis for 
making judgments about student achievement within 
the classroom. Ibachers often understand intuitively 
their own potential for errors in judgment and the 



Center fcr Children and Ucboology, op. cit., footix>tc 5, p. 8. 

^National Fouodation for Educational ResearclVBisbop Orosseteste College, Linooln Consortium, The Pilot Study of Standard Assessment Tasks for 
Q Stage 1--Part 1: Main Tbxt & Comparability Studies (Berkshire, England: March 1991), p. 10, emphasis added. 

FRIC 



240 • Testing in American Schools: Asking the Right Questions 



ways in which student peifonnance can vary from 
day to day. As a result they use daily and repeated 
observations over time to formulate judgments and 
shape instruction. An error in judgment on one day 
can be corrected or supplanted by new observations 
the next. 

The stakes are raised when testing is used for 
comparisons across children^ classrooms, or 
schools, and when test results inform important 
decisions. As noted by several experts in test design 
and poUcy: 

.... when direct measures of performance take on 
an assessment role beyond the confines of the 
classroom — portfolios passed on to next yearns 
teacher, district wide science laboratory tasks for 
program evaluation, or state-mandated writing as- 
sessments for accountability are just a few examples — 
whatever contertual understanding of their follibility 
may have existed in the classroom is gone. In such 
situations, a performance assessment, like any other 
measurement device, requkes enough consistency to 
justify the broader inferences about perfonnance 
beyond the classroom that are likely to be based on 
it. Most large-scale performance assessments are 
being proposed today for fundamentally different 
purposes from those of classroom measurement, 
such as monitoring system performance, program 
and/or teacher evaluation, accountability and broadly 
defined educational reform. Even though none oi 
these uses typically involves scores for and decisions 
about individual students, each is a high stakes 
application of an educational measurement to the 
extent that it can effect a wholesale change in a 
school program affecting all students.^^ 

The feasibility and acceptance of the widespread 
use of performance assessment by policymakers 
must rest on consideration of a number of important 
issues. In addition, the purpose of a particular test 
will, in large part, determine the relative importance 
or weight that should be given to each of these 
issues. 

Standardization of Scoring Judgments 

One of the first concerns about the applicability of 
performance assessment to large-scale testing is the 
extent to which hiunan judgment is required in 
scoring, liability across judges and potential for 
bias in scoring could create impediments to using 
these methods for high-stakes testing. For scores to 



yield meaningful inferences or comparisons, they 
must be consistent and conq)arable. A student^s 
score should reflect his or her level of achievement, 
and should not vary as a function of who is doing the 
judging. A key feature of performance assessment is 
the complexity of judgment needed for scoring; 
however, this very complexity, some suggest, may 
be a barrier to its widespread implementation in 
situations where comparability matters. 

For perfonnance assessment to fulfill its promise, 
it must meet challenges regarding reasonable stand- 
ards for reliable scoring, whether this scoring is done 
by individuals, teams, or by machines programmed 
to simulate human judgment. This is an area where 
test publishers have experience and expertise to offer 
school districts and States considering performance 
assessments. As noted above, Arizona has hired the 
Riverside Publishing Co., in part because of experi- 
ence with the Arizona educators and their curricu- 
lum and past testing activities (the Iowa Ibsts of 
Basic Skills and the Ibsts of Achievement and 
Proficiency programs), but also because the publish- 
ers claim expertise in field testing items or tasks and 
providing scales that meet previous standards for 
reliability. 

Because there has been considerable research by 
curriculum experts and the research conmiunity on 
developing and scoring essays and writing assess- 
ments, they present a model that students, teachers, 
and the general public can appreciate. Scoring has 
been made more systematic and reliable by a number 
of procedures. Scoring criteria are carefully written 
to indicate what constitutes good and poor perform- 
ance; representative student papers are then selected 
to exemplify the different score levels. Panels of 
readers or scorers are carefully trained until they 
leara to apply t^f^ scoring criteria^ a manner 
consistent witfi other readers. In most large-scale 
writing assessments, each essay is read by two 
readers. When significant scoring discrepencies 
occur, a third reader (often the ''team leader^') reads 
and scores the essay. \^ous scoring systems can be 
employed from holistic (a single score is given for 
the quality of the writing) to more fine-grained 
analytic scores (each essay is rated on multiple 
criteria). Table 7-1 presents an example of one 
analytic scoring system that focuses on rating five 
aspects of the student^s writing: organization, sen- 
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^Stephen Dunbar, Daniel Koretz, and H.D. Hoover, **Quality Conlrol in the Development and Use of Performance Assessments/* paper presented 
^ he annual meeting of the National Council on Measurement in Education, Chicago* E., April 1991* p. 1. 
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Table 7-l-<;riterla for Analytical Scoring 



Scale: 



Organization: 



Sentence stmcture: 



Mechanics: 



Formal: 



1 



3 



5 

-I 



Uttle or nothing Is written. The 
essay Is disorganized, Incoherent 
and poorly developed. The essay 
does not stay on the topic. 

The student writes frequent run- 
ons or f ragnr>ents. 



The student ottikes frequent errors 
In word choice and agreement 

The student noakes frequent errors 
In spelling, punctuation, and capi- 
talization. 

The fornDat Is sloppy. There are no 
marglnsorlndentatlons. Handwriting 
Is Inconsistent. 



The essay Is not complete. It lacks 
an Introduction, well-devetoped 
txKly or conduskm. The coher- 
ence and sequence are attempted, 
but not adequate. 

The student makes occask)nal er- 
rors In sentence structure. Uttte 
variety In sentence length or struc- 
ture exists. 

The student makes occasional er- 
rors In word choice or agreement. 

The student makes an occastonal 
error In mechanics. 



The handwriting, margins, ano in- 
dentatk)ns have occaskmal Incon- 
sislendee-m tlUe or kiapproprtate 
title. 



The esfiay Is well-organized. It 
contahis an Introductory support- 
ing and concluding paragraph. The 
essay Is coherent, ordered k>gl- 
calty, and fuRy devetoped 

The sentences are complete and 
varied In length and structure. 



The usage Is correct. Vtord choice 
Is appropriate. 

The spelling, capitalization, and 
punctuation are correct. 

The format Is correct The title Is 
appropriate. The handwriting, 
nr»rglns, and Indentations are 
consistent 



SOURCE: Adamo County School District Ho, 12, Northglenn, CO. 



tence structure, use of language, mechanics, and 
format. 

In the California Assessment Program^s writing 
assessments, essays and answers are read by a single 
reader, but there are a variet}' of techniques used to 
maintain consistency of grading. Marked papers 
akeady read are circulated back into the pile to see 
if they get the same grade again; the table leaders 
randomly reread papers to make sure that readers are 
consistent; examples of graded papers are kept 
available for comparison as anchors.*' Using these 
techniques, the inter-rater reliability for the CAP 
writing assessment is about 90 percent in a single 
year, although less high for the same question across 
years. This remains an unsolved problem for CAP 
and other States and districts using group grading if 
they want to make longitudinal comparisons.'^^ 

Other scoring questions related to design have yet 
to be solved. One of these is the time allotted for 
producing a composition. A 15-minute essay, with 
no chance for revision, may not be a true test of the 
kind of writing that is v^dued. Thus, testing time 
affects how reliably the writing sample reflects 
writing skill. Additionally, specifying scoring cri- 
teria and rating scale format are no easy matters. 



Although research has recently provided some 
empirical analysis of the features of writing that 
distinguish skilled from unskilled writing, some 
suggest that the criteria applied to a particular 
assessment may represent arbitrary preferences of 
the group designing the scale. It is difficult but 
necessary to come to a consensus on these issues. 




Essays and writing samples oan be graded oonslstently 
If teachers are trained to apply scoring criteria I>a8ed 
on common standards. In this example, the Educational 
Testing Service has a88emt)led experienced teachers to 
read and score essays written t>y students aoross the 
country on their Advanced Placement examinations. 
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O ^^tchcil and Stempel, op. cit., footnote 2. 
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Policy Implication 

Writing assessments, essays, and courses like AP 
studio art have a proven track record of assessing 
performance in a standardized and reliable fashion. 
Whether these same procedures for obtaining con- 
sistency in scoring can be applied to other forms of 
performance assessment (e.g., portfolios, exhibi- 
tions, oral examinations, and experiment v is as yet 
largely unexplored. Moreover, although inter-rater 
reliability is relatively high (for judging essays), it 
stUl contains some variation that may add error to 
scores. What degree of error in measurement is 
acceptable depends, in part, on the purposes of the 
test. Careful development of scoring criteria and 
intensive training of judges are key to establish- 
ing consistency of judgment. 

Generalizability of Scores: Are the Tests 
N^lid Estimates of What Students Know? 

Most students, current and former, can remember 
taking an essay test and feeling ' 'lucky' ' because the 
questions just happened to hit topics they knew well; 
a high score, perhaps higher than their study and 
knowledge actually deserved, was the result. More 
likely, they remember the time they ' 'bombed" on 
a test, unjustly they felt, because the essays covered 
areas they had not understood or studied as well. One 
of the advantages of item-based tests is that a large 
number of items can be given in a limited amount of 
testing time, thereby reducing the effect of a single 
question on the overall score. 

When only a few tasks are used there is a much 
higher risk that a child's score will be associated 
with that particular task and not generalize to the 
whole subject area that the test is meant to cover. 
Writing assessment provides a particularly good 
example of the problem of generalizing results from 
a single question. In many cases a 30-minute essay 
test is given to students in order to estimate 
something about their overall ability to write well. 
However, a number of different kinds of writing 



tasks can be given. The National Council of Teach- 
ers of English lists five methods of conununication 
in writing^ — narrating, explaining, describing, re- 
porting, and persuading— that provide the frame- 
work for much of the classroom instruction in 
writing.^ When tests are given, the essay question 
(or prompt) can be in any of these modes of dis- 
course. 

Two kinds of information are needed to make 
essay test results generalizable. First, would two 
different essays drawn from the same mode of 
discourse result in the same score? Results of several 
studies cited in a recent review suggest that agree- 
ment between two essays written by the same child 
in the same writing mode is not very high (reliability 
scores range from 0.26 to 0.46).^^ Second, are scores 
for essay prompts from different modes of writing 
similar? For example, if a student is asked to write 
a narrative piece, will the score for this prompt f;e 
similar to a score the same child receives for writing 
a persuasive piece? Results of several investigations 
of writing assessments indicate that correlations 
across tasks are low to moderate. 

Other factors such as the topic of the essay, the 
time limit, and handwriting quality have been shown 
to affect scores on essay tests.^^ Preliminary results 
suggest that a number of tasks would need to be 
administered to any given child (and scores aggre- 
gated across tasks) before a sufficiently high level of 
reliability could be achieved to use these tests for 
making decisions about individuals. One investiga- 
tion of these issues has suggested that six essays, 
each scored by at least two readers, would be needed 
to achieve a level of score reliability comparable to 
that of a multiple-choice test.^^ 

One of the particular problems faced by perform- 
ance assessment is that of substantiating that similar 
generalizations to the whole domain can be made on 
the basis of a few tasks. Very little research exists 
that can shed light on the extent to which different 
performance assessment tasks intended to assess the 



^AJ^. Hieronynius and HD. Hoover, University of Iowa, Writing: Teachers Guide, Iowa Ibsls of Basic Skills, Levels 9-14 (Chicago, IL: Riverside 
Publishing Co., 1987). 

^iDunbor el al., op. oil, footnote 58. See also Petdr L. Cooper, The Assessment of Writing Ability: A Review of Research^* * GRE Board research report 
OREB No. 82-15R (Princeton* NJ: Educational Ibsting Service, May 1984). 

^^Cooper, op. cit., footnote 61. 

^H.M. Breland, R. Camp, R J. Jones, M.M. Morris, and D. A. Rock, * 'Assessing Writing Skill,* ' research monograph No. 1 1 , prepared for the College 
Entrance Examination Board, 1987, cited in Wayne Patience and Joan Auchter, '^Monitoring Score Scale Stability and Reading Reliability in 
Decentndized Large-Scale Essay Scoring Programs,*' paper presented at the annual meeting of the K itional Ibsthig Network in Writing, Montreal, 
O ::anada, April 1989. 
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same set of skills produce similar scores. Dabi from 
writing assessments suggest, for example, that a 
child who produces a superior essay in one format 
may write only a mediocre one on a different day in 
a (Afferent format. 

The issue of generalizability — ^whether a child^s 
performance on one q; two tasks can fairly represent 
what he or she knows in that area — ^is an important 
one that greatly influences the conclusions that can 
be made from tests. Establishing generalizability is 
particularly critical if a test is going to be used to 
make decisions abrut individual students. Again the 
experience of writing assessment offers important 
lessons for other forms of performance assessment: 

It has long been known that neither an objective test 
nor a writing sample is an adequate basis for 
evaluation of an individual student, whether for 
purposes of placement, promotion or graduation. 
[One author] . . . noted that a reliable individual 
evaluaticHi would require a minimum of four writing 
samples, rated blindly (i.e., without knowledge of 
the student's identity) by trained evaluators. It is a 
continuing scandal of school testing programs that 
patently inadequate data are used for placement and 
categorization.^ 

Policy Implications 

Issues of task generalizability present an impor- 
tant challenge to policymakers and test developers 
interested in expanding the uses of performance 
assessment. If individual scores are not required, 
however, sampling techniques can mitigate these 
issues. For example, many large-scale assessments 
of writing administer multiple prompts in each mode 
but each individual child only answers one or two of 
a larger number of prompts. The large number of 
children answering any one prompt, however, al- 
lows generalizable inferences to be made within and 
across modes about levels of writing achievement 
for students as a whole. The use of sampling 
techniques can allow policymakers and administra- 
tors to make generalizable inferences about schools 



or districts T,vithout having to adnunister prohibi- 
tively long or costly tests to every student (see box 
7-1). 

Costs 

The costs of performance assessment represent a 
substantial barrier to expanded use. Performiance 
assessment is a labor-intensive and therefore costly 
alternative unless it is integrated in the instmctional 
process. Essays and other performance tasks may 
cost less to develop than do multiple-choice items, 
but are very costly to score. One estimate puts 
scoring a writing assessment as 5 to 10 times more 
expensive as scoring a multiple-choice examina- 
tion,^ while another estimate, based on a review of 
several testing programs administered by ETS, 
suggests that the cost of assessment via one 20- to 
40-minute es^ay is between 3 to 5 times higher than 
assessment by means of a test of ISO to 200 
machine-scored, multiple-choice items.^ Among 
the factors that influence scoring costs are the length 
of time students are given to complete the essay, the 
number of readers scoring each essay, qualifications 
and location of readers (which affects how much 
they are paid, and travel and lodging costs for the 
scoring process), and the amount of pretesting 
conducted on each prompt or question. Hie higher 
these factors, the higher the ratio of essay to 
multiple-choice costs. The volume of essays read at 
each scoring session has a reverse impact on 
cost — the greater the volume, the lower the per item 
cost.^*^ 

Is performance-based assessment worth the sig- 
nificantly higher direct costs of scoring? First, it is 
important to recall that high direct costs may 
overestimate total costs if the indirect costs are not 
taken into account. As explained in chapter 1, 
comparison of two testmg programs on the basis of 
direct costs a' one is deceiving. Because performance 
assessment is intended to be integrated with instruc- 
tion, its advocates argue that it is less costly than it 



^Suhor, op. cit., footnote 40. The author referred to is Paul Diedcrich, Measuring Growth in English (Urbana, IL\ National Council of Ibachers of 
English, 1974). 

^John Prer tr, **What Is So Real About Authentic Assessment?'* paper presented at the Boulder Conference of State Ibsting Directors, Boulder, 
CO, June 10*12, 1990. 

^^The testing programs reviewed included: * * ... the Advanced Placement Program, several essay assessments we operate for the state of California, 
the College Level Examination Program, the Graduate Record Exam, NAEP, the National Ibacher Examination Programs, and tbf > English Composition 
Ibst with Essay of the Admissions Ibsting Program . . .** Penny Engle, Educational Ibsting Service, Washington, DC, personal communication, June 
10 1991. Multiple-choice tests are scored for $1 .20 per student; in contrast, scoring of the Iowa Ibsts of Basic Skills writing test costs $4.22 per student. 
Frederick L. Finch, vice president, The Riverside Publishing Co., personal communication, March 1991. 

^''*'iigle, op. cit., footnote 66. 
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Box 7-1— Assessing Hands-On Science Sicills 

The Natioiial Science Foundation luu mppoited a lesearch project that attempts to explore reliabiUty. 
transfentulity. and validity issues affecting peifoiniance tasks for bnge-scale science assessmeats.^11ie itseaicheis 
fim devek>ped tlnee diffeiem hands-oo labonuoiy task^ 
raperiinmtand manipiilate eqdpnwnt b 

kinds of paper toweb soaked up tiie most water. The second task leqoiied students to figure qut the contents of a 

number of "mystery boxes" contaidng wires, batteries, and^ li^t bulbs. Hie third msffltmait had studentt 

Jtennfaie IJids of environments sow bugs prefiw (e.g., dark or light, d^ 

by experts wUle diqr performed the experiments; the experts scored studenu according to the pioc^ 
as well as die findings of Reinvestigation. 

Evidence about the validity of these measures was obtained by giving the parttoqMting students a traditional 
multiple-choice standardized test of science achievement, in order to compare the scores thqr obtained on their 
hands-on experiments with the scores received on multiple-choice tests. In additkn. the peifbnnance of students 

who had been taught using a hands-on approach to science was compared to those studying under a more ttadit^ 
approach. 

Results provide some encouragement and some warnings. Among die findings of these initial development 
efforts with fifth and sixth graders were the following: 

• Hands-on investigatkxis can be reliabty scored by trained judges. 

• Mormance on any one of the tasks was not his^ related to that on the odieis. A student 

well on one hands-on task and quite poody on another. This suggests that a substantial number of tasks will 
be needed unless matrix samplhig can be used. 

• Hands-on scores were only modeiatety related to student's scores on the traditional muWpte^ 
test, suggesting that diffbient skills are being tapped. 

• Students who had been taught with a hands-on approach did better on diese tasks than did stude^ 
traditional science classroom, suggesting that the tests are sensitive to classroom instniction. 



iRkhaidi J. ShivdMMi. <m P. Baxter. Jmne Pina. end Jeooifor Yure, "New Itelaolotlaa for Lane-Scale Sdeaoo ^^--^-^r 



appears. Resolution of this issue requires agreement 
on the degree to which any given testing options 
under consideration are integrated with regular 
instruction. 

Second, although a performance assessment may 
provide less data than a typical multiple-choice test, 
it can provide richer information that sheds light on 
student capacities not usually accessible from multiple- 
choice tests. Even in an externally scored wri^'ng 
assessment, for example, teachers can gain insight 
in'o students' writing difficulties by looking not just 
at \he raw scores, but at the writing itself. Similarly, 
some outcomes that cannot be measured on multiple- 
choice tests (e.g., ability to work cooperatively in a 
group) can be assessed in performance tasks. 

Finally, many educators maintain that the staff 
development that accompanies performance assess- 
ment is in itself a valuable byproduct. For example, 

ERIC 



when teachers gather to discuss what distinguishes 
a weak piece of writing from an acceptable or an 
exceUent piece of writing, they learn from one 
another and internalize the teaching standards. 

The major problem in approaching an analysis of 
the costs of perfomiance assessment is a lack of a 
common base for the information. >Vhen the Council 
of Chief State School Officers compiled a chart of 
performance assessments in the States in order to 
make comparisons, they asked for reporting under 
the category of "costs." As the data came in, the 
numbers fluctuated dramatically, because different 
respondents thought of costs differently: some 
reported costs of development ($2 million in one 
case), some costs of administration ($5 per sUident), 
and some combined them. In the end, the researchers 
decided to eliminate the question altogether because 
it could provide no meaningful information and 
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Hands-on assessments like this aie costly in time, equipment, and human lesouices. Because of this, Aese 
investtgatois also sought ''sunogate tadu** that mi|^t provide mudb of die hifonnation obtained fix>m hands-on 
tasks but at consideiaUy lower cost lb this end th^ created die following sunogates for the diiee experiments, 
listed hi Older of ^'conceptual verisimilitude** (shnilarity to die hands-on experiments): 

• labcratoiynotdK)oks students kqM as a record of th^ 

• computer simulations; 

• dioitsmswer,p^per««nd"pencil questions based on die experh^ 

• muhipk^choice items based on die hands-on pioceduies. 

The researchers dien exandned die extent to whidi diese various surrogates were exchangeable for die 
hands-on benchmaric tasks. If shnpler, less costly mediods can pc^ 
Fkelimfaiary findhigs from diese faivestigatkms suggest die following: 

• Labomtoiy notebooks piovkle die best surrc^ate for ^ 
in lieu of direct observatkm. 

• In die computer simuladons, die computer saved aU die cfaiM*smoves^ 

by die evahiator. The average dme required for grading was about one-tendi of diat needed for observing 
hands-on invesdgatk>n»— suggesthig diat computer simulatkms can ciftt a big savings in skilled personnel 
time. 

• Neidier the computer shnulation nor die paper-and-poicil measures i^ipeared to be adequate substitutes for 
die benchmark hands-on procedure. The computer simuladon showed considerable variability for individual 
8tudents-HK>me individuab q^^ear to do veiy weU on dds typ^ 

• The students endiusiasticaUy participated in die hands-OT 

As investigators diroughout die country begin to develop new peifornvmce assessments, diey will need to 
collect data like diis in order to evaluate die tedhnkal quality of diek new measures. As one (tf die investigatois 
involved in die above study conchides: *\ .. diese assessments are delict 
pikMing to fine tune dieniL**^ Pecause so many investi^^ 

territories, research support will be needed to encourage die collecdon of t?«t data and die dissemination of results 
so dutt odiers can learn from data diat are innovative, instructive, and yet costiy to obtain. 



^Ucfaitdl. Slisveltoo,*'Att!he(ntteAMeii^^ 



would require extensive explanation no matter what 
it included.^ 

In light of these uncertainties about the relative 
costs of testing programs, some school systems are 
striving for improved definitions and better cost 
data. In California, for example: 

The lead consortium is required to develop a 
cost-benefit analysis of existing vs. various types of 
alternative assessment for consideration by the 
CMfomia Department of Education and tfic State 
Board of Education. The cost-benefit analysis should 
consider payoffs, tradeoffs and advantages or disad- 
vantages of alternative vs. existing assessment 
practices. The testing costs of alternative assess- 
ments, especially die staff development component, 
should be considered as a part of overall curriculum 



costs. Ibachers' renewed motivation and commit- 
ment to die Curriculum Frameworics should be 
viewed as a major element in die cost-benefit analy- 
sis.^ 

Policy Implications 

In considering the costs of perfomiance assess* 
ment, policymakers may wish to adopt a more 
inclusive cost-benefit model than has typically been 
considered for testing. Benefits in the areas of 
curriculum development and teacher enhancement 
(staff training) may offset the higher costs associated 
with performance assessment. However, littie data 
has been collected to date; a broa^'^r and deeper 
analysis will be required before judgiuents can be 
made. 



^Mitchell and Steoipcl, op. cit., footnote 2, p. 11. 

^Califotnit Dcpartmeni of Education, California Assciwmcnl Program, ''Request for Applications for the Allcraativc Assessment Pilot Project,** 
j)ublished document, 1991. 
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Fairness 

TTiere has long been a concern about the effect of 
background factors such as prior sxperience, gender, 
culture, and ethnicity on test results. Achievement 
tests, for example, need to eliminate the effect of 
background factors if they are to measure learning 
that has resulted from instruction. A combination of 
statistical and intuitive procedures have been devel- 
oped for conventional norm-referenced tests to 
eliminate or reduce background factors that can 
confound their results. Little is known, however, 
about how background factors may affect scores on 
performance assessments. 

In addition, judgments about fairness will depend 
a great deal on the purposes of the test and the 
interpretations that will be made of the scores. For 
example, on a test thet has no significant personal 
impact on a student, such as the National Assess- 
ment of Educational Progress, it is reasonable to 
include problems that require the use of calculators 
even though student access to calculators may be 
quite inequitable. On the other hand, equitable 
access would be an important consideration if the 
assessment were one that determined student selec- 
tion, teacher promotions, or other high-stakes out- 
comes.'^ 

Performance assessments could theoretically lead 
to narrowing the gap in test scores across those who 
have traditionally scored lower on standardized 
multiple-choice achievement tests. By sampling 
more broadly across skill domains and relying less 
heavily on the verbal skills central to existing 
paper-and-pencil tests, proponents hope that these 
differences might be minimized. Performance as- 
sessments, by providing multiple measures, may be 
able to give a better and therefore fairei picture of 
student performance. y 

On the other hand, performance assessments 
could exacerbate existing differences between 
groups of test takers from different backgrounds. 



Some minority group advocates, for example, fear 
that tests are being changed just when students from 
racially diverse backgrounds are beginning to suc- 
ceed on them. They worry that the rules are being 
changed just as those who have been most hurt by 
testing are beginning to learn how to play the game. 

The President of the San Diego City Schools 
Board of Education voiced the apprehensions of the 
minority community: 

We have a long way to go to convince the public 
that what we're doing is in the best interests of 

children When we talk about the issue of equity, 

the kind of assessments we're talking about require 
much more faith in individuals and the belief that 
people can actually apply equity in testing. Most of 
the time with a normed test you think of something 
that has some subjectivity in the development of the 
insmmient, but then in the final result you know 
what the answer is. When you start talking aboi't 
some of the assessments we're doing — portfolios — 
it's all subjective.'^' 

Research on the effects of ethnicity, race, and 
gender on peiformance assessment is extremely 
limited. Most existing research has explored group 
differences on essay test scores only. Moreover, 
almost all the subjects in this research were college- 
homd students, limiting its generalizability consid- 
erably. Results of studies that examme the perform- 
ance of women relative to men suggest that women 
y srform somewhat better on essays than they do on 
multiple-choice examinations.'^^ 

Studies that report results for differs u Minority 
groups are even more scarce. Results are mixed but 
tftnd to suggest that differences on multiple-choice 
U sts do not disappear when essays are used. For 
example, data from NAEP indicate that black/white 
differences on essays assessing writing were about 
tie same size as those observed on primarily 
multiple-choice tests of reading comprehension.'^^ 
Similarly, adding a performance section to the 
California Bar Examination in 1984 did not reduce 



L'j^c^!"* "^^"P'"'^ Performance-Based Assessment: ExpcctaUons and VUidaUon Cri.cria," Educational 

Cal^oiSL'SJSnTCaSS;^^^^ ^ "'^ ^ ^'-P*"' '^^^^ 

r J?;^" ^'?! "!f r <^'"Vflrtjt»fl/or Basic Skiilj Measures (New York, NY: CoUcgc Entrance ExaminaUon Board. 1981); 

Cooper, op clL. footnote 61; S3. Dunbar. "Comparability of Indirect Assessment of Writing SldU as Pr^Slctors of Writing Perf ormanS Acroa 
^iSffi'',??'"; '«'P"»'^'*f,'°««»criP'. J»ly 1991- Brent Bridgeman and Charle. Lewis. "Predictive \Uid.*ty of Advil:ed Placement Essay 
and Midtiple Choice ExamtaaUons. paper presented at the annual meeting of the National CouncU on Measurement in Education. Chicago. IL. April 
1991; and Traub and MacRury, op. cit.. footnote 29. -f ^ , ^, nyi i 

^ ''Cited in Linn el al., op. cit., footnote 70. 
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the difference in passi|^ rates between blacks and 
whites. On the contrary, some studies have sug- 
gested that ethnic group differences actually in* 
crease with essay examinations.*^^ 

On the other hand, another study showed that 
minority college students in Califomia actually 
performed better on tests that were direct measures 
of writing ability (the Califomia State University 
and Colleges English Placement Tfest Essay Tfest or 
EFT) than on a multiple-choice test of English usage 
and sentence correction (the SO-question, multiple- 
choice formatted Tfest of Standard Written English 
or TSWE). In this study, score distributions on the 
TS WE and the EPT were similar for white students. 
Among African-American, Mexican-American, and 
Asian-American students, however, the two te3ts 
generated different score distributions. For these 
groups, the TSWE rendered a much more negative 
judgment of their English proficiency than the 
EPT.75 

Policy Implications 

Because of the limited research on the differing 
subgroup performance on new assessment in- 
struments, Congress and other policymakers 
should approach these changes with caution. Data 
on the impacts of performance assessment on 
varying groups is needed in considering extension to 
more high-stakes applications. Careful planning, 
including representatives of groups traditionally 
negatively affected by testing, will be required in 
developing, administering, and scoring performance 
assessments for school accountability, student certi- 
fication, or other selection purposes. 

Role of Teachers and Teacher Training 

In performance assessment, the role of the teacher 
in administering and scoring tests is much greater 
than with multiple-choice tests. Although some 
performance assessments still rely on outsiders to 
conduct the scoring of papers, in tlie future, class- 



room teachers are likely to have greater responsibU* 
iiy. 

Although teachers observe performance all day, 
most have not been involved in defining and 
determining standards of performance conunon to 
those of their colleagues. In Sweden and several 
other countries a process called ''moderation** 
refers to the development of a standardized scoring 
approach among multiple teacher readers. The 
procedure is similar to scoring of the Advanced 
Placement tests and other examinations relying on 
panels of scorers. It requires an intensive effort to 
agree on standards of performance. How does 
excellent work vary from that which is only fair or 
is not acceptable at all? This process is based on a 
shared understanding of curriculum, respect for 
teacher judgment, compromise, shared values, and a 
strong dose of conmion sense. This may be easier to 
manage in those countries where there is a common 
curriculum and a more homogeneous teaching 
population that has been prepared under a central 
system of teacher training institutions. It is not clear 
that this can be adopted in the U.S. system. One 
educator suggested: **If we can trust our teachers to 
teach, we should be able to trust them to assess 
students. ••^^ 

Ibachers in this country receive little formal 
training in assessment. A recent survey found that 
fewer than one-third of the States require new 
teachers to have demonstrated competence in educa- 
tional measurement.*^ A survey of the six States in 
the Pacific Northwest reported that only Oregon 
explicitly requires assessment training for certifica- 
tion.''^ 

One reason fox the neglect of assessment training 
may be the assumption on the part of educators that 
the quality of assessments in the classroom is 
assured from outside the classroom; that is, most 
assessment is **teacher proof,*' beyond the control 
of the teacher.''^ Tfextbooks come with their own 



^^Breland and Oriswold, op. cit.. footnote 72; Dtinbwr. op. cit., footnote 72; Ina Mullis. •'Use of Alternative ABsessment In National Assessments: 
The American Experience, * * paper presented at the Office of EducaUonal Research and lastruction conference on the Promise and Peril of Alternative 
Assessment, Washington, DC, Oct. 30, 1990. 

■'^Edward M. White and Leon L. Thomas. • 'Racial Minorities and Writing Skills Assessment in the California State University and Colleges/ ' CoUexe 
E«^//5^. vol. 43. No. 3. March 1981. pp. 276-283. 

^^Jack Webber, teacher, Samantha Smith Elcmcotary School. Redmond, WA. personal communication, 1991. 

^••Tbfiting.** Education Week, vol. 10. No. 27. Mar. 27. 1991. p. 9, 

^•Richard J. SUggins, ••Ibacher Training in Assessment: Overcoming the Neglect.* * Teacher Training in Assessment, vol. 7 in the Euros Nebraska 
Symposium in Measurement aJKl Testing, Steven Wise (cd.) (New York, NY: L. Erlbaum Associates, in press). 
^)'~lbid.,p. 6. 
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worksheets and quizzes, unit tests, and even compu- 
terized test items, so teachers feel little responsibil- 
ity for developing their own. Yet many of these 
text-embedded tests and quizzes are in fact devel- 
oped in the absence of quality control standards. 
Furthermore, the tests that teachers know will be the 
ultimate judge of student proficiency are seen as 
beyond die teacher's responsibility. Finally, the 
courses on testing are often seen as irrelevant to the 
classroom.^ There is very little treatment of assess- 
ment as a teaching tool. Ibachers regularly use 
assessments to communicate achievement expecta- 
tions to students, using assignments both as practice 
and as assessments of achievement, involving stu- 
dents in self and peer evaluation to take stock cf their 
own learning with practice tests. This important area 
is neglected in teacher training.^^ 

The inservice training situation is not much 
different.^^ However, if standard teacher courses in 
measurement are irrelevant, there is no reason to try 
to get more teacher candidates or practicing teachers 
to take them. On the other hand, if teachers are 
trained in new curriculum frameworks that have 
been the basis for much of the move to performance 
assessment, the techniques of teaching and assessing 
should be taught as a whole. This is the approach 
being taken in California, Arizona, and Vermont, 
and envisioned for Kentucky. 

Ibchnology can be a means to fast and efficient 
delivery of teacher training, as in Kentucky, where 
the educational television network provides satellite 
downlinks to every school in the State, making it 
possible to get the word out to all teachers simultane- 
ously. And, if administrators are to und^stand the 
role of assessment in curricular change, and be able 
to communicate with the public about school 
attaiimient of intended outcomes, they too need 
training in changing methods and goals of classroom 
and large-scale assessment. 

Policy Implications 

If performance assessment is given a larger 
role in testing programs around the country, 



teachers will need to bo involved in all aspects: 
designing tasks, administering and scoring tests, 
and placing test results into context. Ibacher 
training will need to accompany these efforts. 
Redesigning the tests will not change teaching 
unless teachers are informed and involved in the 
process. The tests themselves could block educa- 
tional progress unless classroom teachers are given 
a larger sense of responsibility for them. 

Research and Development: Sharing 
Experience and Research 

Performance assessment has been spurred primar- 
ily by State Departments of Education as they 
endeavor to develop tests that better reflect their 
particular curricular goals. Yet there are many 
common goals and concerns that have led them to 
come together to share experience with each other. 
In an effort to encourage the development of 
alternative methods of assessment, the U.S. Depart- 
ment of Education has supported the development of 
a State Alternative Assessment Exchange. The goal 
is to create a database of new forms of assessment, 
develop guidelines for evaluating new measures, 
and help prevent States from making costly mis- 
takes. This collaborative effort, led by the De- 
partment's Center for Research on Evaluation, 
Standards, and Student lasting (CRESST) and the 
Council of Chief State School Officers, is aimed at 
facilitating development work, not at creating a new 
test. 

The National Science Foundation (NSF) has also 
played an important role in supporting research 
leading to new approaches to assessment in mathe- 
matics and the sciences. NSF supported NAEP in the 
development and pilot testing of hands-on assess- 
ment tasks in mathematics and science. Several of 
these tasks were adopted by the State of New York 
for their hands-on science skills test for fourth 
graders. More recently, NSF has committed $6 
million for 3 years to support projects in altemative 
assessment approaches in mathematics and science. 
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8in>id., p. 8. 

<2Tliere are some exceptions, however. For example, the Noithwcst Regioual Educational Laboratoty has created a video-based training program that 
places critical assessment competencies within reach of all teachers and administrators. Tbey have also created ' 'trainer-of-trainer* * institutes that wUl 
make it possible for attendees to present to teachers and others a series of workshops on such topics as understanding the meanittg and importance of 
high-quality ciassnnmi assessment; assessing writing proficiency, reading proficieiicy, and higher order thinking in the claf 5n>om; developing sound 
grading practices; understanding standardized tests; and designing paper-and-pencil assessments and assessments based, on observation and judgmrat. 
O ihwest Regional Educational Laboratory, The Northwest Report (Portland, OR: October 1990). 
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Assessment research remains a small part of the 
overall Department of Education research budget. 

Greater effort should be directed toward monitor- 
ing the developmrat of performance assessment and 
sharing information about models and techniques to 
facilitate implementation, prevent duplication of 
effort, and foster collaboration.^^ 

Policy Implications 

Because performance assessment is at a devel- 
opmental stage, encouraging States and districts 
to pool experience and resources is an appropri- 
ate policy goal. Expanding research and comparing 
results requires a thoughtful atmosph^e and ade- 
quate time. Although States are maldng progress in 
ledesigning testing to serve educational goals, 
pressures for quick implementation of low-cost tests 
could present a barrier to this goal. Conmiitment to 
research projects and careful weighing of outcomes 
is essential to an improved testing environment. 

Public Acceptance 

One of the greatest problems with tests is the 
misuse of data derived from them. There is no reason 
to believe this would not also be true with perform- 
ance assessment. 

Because performance assessments aim to provide 
multiple measures of achievement, it may be diffi- 
cult for parents, politicians, and school officials to 
undti stand its implications. Hie public has grown 
familiar with test results that rank and compare 
students and schools; it may be difficult to appreci- 
ate the information derived from tests that do not 
follow this model. Some attempts are being imple- 
mented to improve public understanding of the goals 
and products of performance assessment, through 
such vehicles as public meetings. But it is not easy. 
The press may be among the most difficult audi- 
ences to educate, since simple measures and statis- 
tics, ranking and ordering, and comparing and listing 
winn^s and los^s makes news. Nevertheless, they 
may be the most important audience, since so much 
of the public^s awareness of tesdng comes from 
press reports. 



Policy implications 

Policymakers need to carefully "consider the 
importance of keeping the public and press 
aware of the goals behind changing testing 
procedures and formats and the results that 
accrue from these tests. If not, there is a strong 
likelihood of misunderstanding and in^atience that 
could affect the ability to proceed with long-term 
goals. 



A Final Note 

Writing assessment is up and running in many 
States. Although careful development is needed and 
issues *d bias and fairness need attention, this 
technology is now workable for all three major 
testing functions. 

Other methods of performance assessment (e.g., 
portfolios, exhibitions, experiments, and oral inter- 
views) still represent relatively uncharted areas. 
Most educators who have worked with these tech- 
niques are optimistic sibout the potential they offer 
for at least two functions — testing in the clessroom 
for monitoring and diagnosing student progress, and 
system monitoring through sampling. However, 
much research is needed before performance tasks 
can be used for high-stakes explications where 
students are selected for programs or opportunities, 
certified for competence, and placed in programs 
that may affect their educational or economic 
futures. Some of this research is now under way foi 
tests used for professional certification (see ch. 8), 
but much more research support is needed for 
understanding the implications in elementary and 
secondary schooling. Finally, even the most enthusi- 
astic advocates of performance assessment recog- 
nize the importance of policies to guard against 
inappropriate uses. Without safeguards, any form of 
testing can be misused; if this were to happen with 
performance assessment, it could doom a promising 
educational innovation. 



*^Joe B. Hansen and WsUer B. Hathaway, Survey of More Authentic Assessment Practices/* paper presented at the National Council for 
Measurement in Bducatioo/National Association of Ibst Developers symposium^ More Authentic Assessment: llieory and Practice, Chicago, IL, Apr. 
O I. 
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CHAPTER 8 

Information Technologies and Testing: Past^ Present^ Future 



Highlights 

• Monnation and data processing technologies have played a critical role in making existhig modes of 
testing mote efficient r « combination of the miilt4>le-choice item fonnat and machine scoring 
technologies has made it possible for massive numbers of students to be tested all through their 
educational careers. 

• By and large, computers and other information technologies have not been applied toward 
fundamentally new ways of testing. However, advances in computers, video, and rchited technologies 
could one day revolutionize testing. 

• Computer-based testing and computer-adaptive testing can have several advantages over conventional 
paper-and-pencil tests. Tliey are quicker to take and score, provide fiuter feedback, and reduce enors 
due to human scoring and ad m inist r ation. Some con^uterized tests can hone hi ou students' 
achievement levels much move quickly and accurately than conventional tests. 

• Cutting-edge technology could push tests well b^ond tiie existing piq;)er-and-pencil formats. 
Structuring and presenting complex tasks, tracting student cognitive processes, and pcoviduig rapid 
feedback to learners and teachers ate promising avenues for continued research and development 

• Cofloputerized testing ah» has drawbacks. It may ip*roduce new types of measurement enors, phice 
students who hick fiuniliarity widi computers at a diiMdvantage, make it harder for students to skip or 
review questions, raise new privacy issues, and create questions of comparability when students take 
essentially "personalized" tests. 

• Realizhig die full potential of new testing technologies will requhe continued research, and better 
coordinated researeh, hi tiie fields of leamhig theory, computer science, and test design. 



Information and data processing technologies 
have had a powerful influence on educational 
testing. The invention of the multiple-choice item 
format, coupied with advances in machine scoring, 
made possible the efficient testing of millions of 
children at all stages of their education. But these 
efficiency attributes of machine-based scoring and 
reporting also raised serious concems: from the 
earliest days of application of these teclmologies, 
critics lamented the loss of richness in detail tiiat had 
been a feature of open-ended questions scored by 
human judges, and contended that machine-scored 
tests encouraged memorization of unrelated facts, 
guessing, and other distortions in teaching and 
learning. 

Multiple-choice items and machine scoring of 
tests brought a revolution in student assessment. 
And, not surprisingly, once the technology became 
an entrenched feature of school life, tiiere began a 
70-year period of gradual evolution: as information 
and data processing technologies become more 
powerful and sophisticated, tiiey continued to influ- 
O ; educational testing, but die applications have 
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principally improved automation of the basic test 
designs initiated at die turn of tiie ce. (ury . There has 
been relatively littie exploration of L;, m die technol- 
ogy might open altogether new approaches to 
student assessment. Ibday, however, some experts 
believe a new revolution is in the making: they 
contend tfiat the maeasing power and flexibility of 
personal computers, video, and teleconmiunications 
could move testing well beyond what paper-and- 
pencil testmg can accomplish. 

The purpose of this chapter is to examine the state 
of the art of infonnation technologies in testing, 
consider policy initiatives tiiat could foster better 
uses of current technology, and explore die possibil- 
ities for wholly new paradigms of student assess- 
ment. The chapter is divided into four sections. TTie 
first provides a brief historical synopsis of technol- 
ogy in testmg, focusing on die combined effects of 
multiple-choice and electromechanical scoring. 

The second section is concerned witii applications 
of computers and video-related technologies to 
conventiouj?! models of educational assessment It 
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addresses issues such as test design and construc- 
tion, scoring and analysis of test results, item 
banking, computer-adaptive testing, and new video 
and multimedia applications. 

The third section of the chapter describes the gap 
between ciurent and future models of testing, and 
explores ways in which computers or other technolo- 
gies could advance the development and implemen- 
tation of new models. 

Finally, the fourth section examines key policy 
issues in developing new models of testing. 

Historical Synopsis 

Multiple choice made its debut in 1915 with the 
Kansas Silent Reading Tfest, produced by Frederick 
Kelly at the State Normal .lool in Emporia. With 
modifications by psychologist Arthur Otis, multiple 
choice . soon found its way . . . from reading 
tests to intelligence tests,'' and made possible the 
admmistration of the Army Alpha and Beta tests to 
millions of draftees during the First World War.^ 
Qerks scored each test by hand, using stencils 
superimposed on answer sheets. This new method of 
testing transformed Alfred Binet's individually ad- 
ministered test format (called by some authors the 
**methode de luxe'*^) into a format amenable to 
group administration and the development of group 
norms. According to one chronicle of this technolog- 
ical change: 

... the multiple choice question [was] ... an 
invention ingenious in its simplicity . . . [an] indis- 
pensable vehicle for the dramatic growth of mass 
testing in this countiy in the span of a few years. It 
had not existed before 1914; by 1921 it had spawned 
a dozen group intelligence tests and provided close 
to two million soldiers and over three million 
schoolchildren with a numerical index of tlieir 
intelligence; it was also about to transform achieve- 
ment testing in the classroom.^ 



It was the Iowa testing program, under the 
leadership of E.F. Lindquist, that was instrumental 
in turning the twin concepts of group testing and the 
multiple-choice item format into a streamlined 
process for achievement tesdng of masses of school 
children.^ Lindquist took the first hand-scored tests 
and designed a scoring key that could be cut into 
strips, each strip fitting a test page, with the answers 
positioned on the key to match the pupil's responses 
on the page. Later, Lindquist pursued his dream of 
mechanical, and later electronic, scoring. IBM's 
prototype photoelectric machine encouraged Lind- 
quist, who built his own analog computer in the 
1940s. During the 1950s, he embarked with Profes- 
sor Phillip Rulon of Harvard in an effort to design an 
electronic scoring machine. Their basic iimovation 
has since become a staple of the testing industiy: 

... a specially designed answer sheet would pass 
under a row of lAotc tubes in such a manner that each 
photo tube would ser^se a mark in one of the boxes 
on the answer sheet when illuminated by a light 
source, and the pulses from this sensing would 
trigger a counter cumulating a total raw score for 
each test on the answer shtct: the raw score would be 
converted to a standard score in a converter unit; the 
standard score would be recorded by an output 
printer geared to the scoring device.^ 

The first ' *Iowa machine'' went into production in 
1955, and cost close to $200,000 (nearly three times 
more than plaimed).^ Continuing refinements 
through 1957 led Lindquist to boast that the machine 
was living up to virtually all expectations. It could 
now, in a single readmg of an answer sheet, obtain 
up to 14 separate raw scores; convert these into 20 
different standard scores, percentile ranks, or con- 
verted totals of the converted scores; obtain simulta- 
neously as many totals and/or subtotals as the 
desired combinations of counters would permit; 
print and punch scores simultaneously; print or 
punch both names and scores simultaneously; and 



JPranz Samelson, * 'Was Early Mental Tbsting (a) Racist Inspired, (b) Objective Science, (c) A Tfecbnology for I>cmocracy. (d) Hie Origin of Multiple 
Choice Exams, (c) None of »hc Above? Mark the RIGHTT Answer/ * Psychological Testing and American Society, 1890-1930, M. Sokal (od.) (New 
Brunswick, NJ: Rutgers University Press. 1987). pp. 1 13-127. Sec also cL 3 of this report for discussioa. and ch. 1 for a reproduction of the cover of 
the 1915 Kansas test. 

^Rudolf Pinlncr. cited in Samelson. op. cit,. footnote 1, p. 1 16. 

^Samelson, op. cit.. footnote 1, 

^For a comprehensive discussion of the history of the Iowa program, sec Julia J, Peterson. The Iowa Testing Program (Iowa City. lA: University 
of Iowa Press. 1983,) For discussion of the principal roles of Lewis Tbrman, Edward Thomdike. Robert Yerkes. and others in the birth of the 
group-administered intelligence and achievement testing movement, see. e.g.. Paul Chapman, Schools as Sorters: Lewis M, Terman, Applied Psycholo gy, 
and the Intelligence Testing Movement, 1890-1930 (New York, NY; New York University Press. 1988); also sec ch. 3 of this report. 

^Peterson, op. cit., footnote 4, p. 91. 
^ %id.,p.89. 
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do a number of ^^interesting tricks** it was not 
originally intended to do. 

A new era of testing in American schools had 
dawned. Here is how one test publisher, whose 
experiences date from the earliest days of this new 
era, describes the transition: 

. . . [before machine scoring] most standardized tests 
were hand-scored by the teachers. . . . Under that 
system, tests corrected and scored by the teacher 
provided opportunity for careful pupU analysis by 
the teachers. In turn that analysis, pupil by pupil and 
class by class, provided meaningful measures for 
individualizing pupil instmction, improving instruc- 
tion, reassessing the curriculum, and making appro- 
priate textbook selections. Fuithermore, and by no 
means should this be overlook^, it gave the teacher 
support beyond his or her iindocumoited human 
judgment of pupils that by no means goes unchal- 
l^ged by many parents and, for that matter, pupils. 
As the machine-scoring movement grew, the activi- 
ties related to testing changed. Certainly, the scoring 
activity left the classroom and oft^ as not the school 
system itself. Ifest results moved increasingly into 
the hands of the administrative staff. Ibst specialists 
were employed who were interested in an ever 
broader array of derived scores to be used f(x many 
purposes ... the hands-on dimension for teaches 
receded and in due course disappeared almost en- 
tirely.^ 

Current Applications of Computers 
in Testing^ 

Design and Construction of Tests 

Item Writing 

Computers have many capabilities that can aid 
test publishers in the efficient design and construc- 
tion of standardized tests. In addition, basic word 
processing, graphics, and spreadsheet progiams 
make it possible for State and district school 
personnel, as well as individual teachers, to create 
their own items or to edit items developed by others. 
Editing the text of test items, selecting specific items 



from a collection stored in memory, and sequencing 
the test items are all substantially easier with basic 
desktop computers and generic tool software. 

Increasingly, however, dedicated item writing and 
test construction packages have become available. 
These go beyond the capacity of generic word 
processing software and are intended specifically for 
writing tests. For example, they can contain item 
templates and special notations such as mathemati- 
cal symbols not usually available with commercial 
word processing software. Once the test is created on 
the computer, it can then be printed out, reproduced, 
and administered to students who fill in the re- 
sponses in the traditional paper-and-pencil format. 

Using computers to construct items is not a new 
concept. Researchers in the 1960s had attempted to 
develop software to facilitate the construction of 
sentence completion and spelling items, but the 
software was not adopted by test constructors.^ This 
is explained in part by the feelirig among some 
experts that item writing for educational and psycho- 
logical testing is more art than science, and that 
computer technology routinizes what ought to be a 
more fluid and creative process. Most item-writing 
efforts for standardized achievement tests involve an 
interplay between content specialists (teachers in the 
content areas) and psychometric experts who iden- 
tify item-writing flaws and examine the match 
between items and objectives of the test.^^ 

Item Banking 

Increases in computer memory capacity have 
made ^4tem banks** an important enhancement in 
test construction. Large collections of test items are 
organized, classified, and stored by their content 
and/or their statistical properties, allowing test 
developers or teachers to create customized tests. 
Item banks in use today consist almost exclusively 
of multiple-choice or true-false questions, although 
there is some research under way on the use of 
CD-ROM teclinology to store longer open-ended 
items.^^ 



'Harold Miller, former Chairman of the Board* Houghton MifOin Co^ Inc., personal communication, Dec. 14, 1990. 

^This section draws on C . V. B underson, 13 . Olsen, and A. Gieenberg, ' 'Computers in Educational Assessment^ ' ' OlA contractor report^ December 
1990, 

^se-chi Hsu and Shula F. Sadock, Computer Assisted Test Construction: The State of the Art (Washington, DC: ERIC Clearinghouse on Tbsts, 
Measurement* and Evaluation, American Institutes for Research, November 1985), p. 5. 

*®Gale H. Roid, * *Item Writing and Item Banking by Microcomputer: An Update, * * Educational Measurement Issues and Practice, vol, 8, No . 3, fall 
1989, p. 18. 

^^Scc, e.g., Judah Schwartz and Kalhcrinc A. Viator (eds.), The Price of Secrecy: The Social, Intellectual, and Psychological Costs of Current 
Q essment Practice: A Report to the Ford Foundation (Cambridge, MA: Harvard Graduate School of Education, September 1990). 
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A variant on the item-bank concept is one in 
which testing objectives are stored in the form of 
algorithms that can be used to create individual test 
items. The algorithm draws on stored data to 
produce a vast number of variations on an objective. 
Instructors choose the objective and specify the 
number of different problems, and the computer 
provides the appropriate test items (see figure 8«1). 
One item bank currently on the maiket covers 
mathematics objectives, from basic mathematics 
through calculus.^^ If the teacher wishes to st a 
student on adding two two*<ligit numbers, the 
objective is represented as A + B, where A and B are 
whole numbers greater than 9 and less than 100. The 
computer would then insert random numbers for A 
and B, so that literally thousands of different items 
sharing a similar measurement function can be 
produced. The system can be customized to meet the 
objectives of States, districts, or even specific 
textbook or curriculum objectives. 

Constructing standardized tests to meet the elabo- 
rate and detailed test specifications of school dis- 
tricts and States is a complex and time-consuming 
task. Computers can help speed and streamline this 
task by selecting test questions for use in a test form 
to match detailed statistical and content specifica- 
tions. After the computer selects test questions for 
the first draft of a test form, these items can be 
leviewed by test development staff, and possibly 
field tested.^^ Computing power greatly speeds up 
this process and makes it possible for States and 
local education authorities to create tlieir own 
standardized tests as well as varying forms of the 
same test for multiple administrations. 

Among the many applications of the item-bank 
concept, a large-scale effort begun in West Virginia 
in 1988 offers some useful lessons. As part of a 
larger effort to restructure financing in the State and 
to assess learning outcomes for students, the State 
purchased 1,200 copies of the testing software, one 
for every school in the State. Reflecting a bottom-up 
strategy, the system allows teachers to select items, 
construct their own tests, print them out, copy them, 
and administer them in the traditional p£q}er-and- 



Figure 8-1— Three Questions Created by 
One Algorithm 

1, What fraction of this figure is shaded? 



A. 5/7 B. 5/12 

C. 7/12 D. 5 



2. What fraction of this figure is shaded? 



A. 3/10 B. 3 

C. 3/7 D. 7/10 



3, What fraction of this figure is shaded? 



A. 2/3 B. 3 

C. 1/3 D. 1/2 



SOURCE; lp« Publishing, Exam In a Can (brochure) (West Lake Village, 
CA:1990). 

pencil format. Score results can be analyzed and 
student progress tracked through the use of instruc- 
tional management software. A pilot test of the 
system highlighted the fact that teachers needed 
training on how to use the hardware and software 
and that the existing infrastructure of computers for 
teachers was inadequate. Among the benefits noted 
were the ease in generating tests for many uses and 
the advantages of relieving teachers of some of the 
''busy work'' of test construction and administra- 
tion. 

The West Virginia system deals with traditional 
subject areas. Note, however, that in its request for 
proposals for a computer system, the State sought a 
system capable of storing item types other than 
multiple choice and true-false, with software avail- 
able in both IBM and Apple formats. 



i^ips Publishing. Exam in a Can (computer software) (Westlakc Village. CA: 1990). 

l^Mark D. Reckase. director. Development revision. Assessment Innovations. American College lasting Program, personal communication, 
September 1991. See also DatoN.M. d^Oruijter. *'7bst Constniction by Means of UniMf^gnniB^,** Applied Psychological \fea^^ vol. 
14.No. 2, 1990. pp. 175-182; andEUenBeokkooi-Tinuninga, ''IheConstnictionof PaniUetlbstsFromlRT-BasodltemBanks/ Vo(ir/m/o/E(/M^^^ 
Statisfics, vol. 15. No. 2. 1990. pp. 129-145. 

i^John A. Willis. '^Learning Outcome lasting Program: Standardized Classroom Ibstiog in West Virginia Through Item Banking. Ibst Generation^ 
O Curricular Maiwgcmeut Software.** Educational Measurement: Issues and Practice, vol. 9. No. 2, summer 1990. pp. 11-14. 
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Scoring, Reporting; and Analyzing 
Test Results 

Computers are now vital to large-scale testing 
programs. Hiey allow for fast and efficient scanning 
and processing of answer sheets, computation of 
individual and group scores and subscores, and 
storage of score data for later analysis. Item analysis 
and item-response theory statistics can be calculated 
across large nimibers of test takers, and the item and 
test statistic fUts can be automatically updated using 
only a few simple commands. Archival copies of test 
scores can also be easily made. Con^uters provide 
a wide range of individual and group reports that can 
be printed from the resulting test scores and profiles. 
Computerized interpretative reports are also pre- 
pared for an increasing number of educational and 
psychological tests* 

Large mainframes or computers are used to 
process and analyze test data and to prepare printed 
reports for individual students or groups of students. 
These mainframes and computers are typically 
located at centralized test development, publication, 
and scoring service centers run by test publishers. 

Taking Tests on the Computer 

In addition to their role as workhorses to aid in test 
construction, recordkeeping, and analysis and re- 
porting of results, computers can also be the medium 
on which tests are adniinistered. This report defines 
computer-based testing (CBT) as ^plications in 
which students respond to questions directly on the 
computer, via keyboard, keypad, mouse, or other 
data-entry device. Tfest booklets, fiU-in-the-bubble 
answer sheets, and other traditional paper-and- 
pencil testing techniques are not u&ed.^^ 

Classroom Testing With Networks 
and Integrated Learning Systems 

Much of the available computer software de- 
signed for instruction includes questions throughout 
the program designed to check on a student's 
understanding of the material. Responses can be 
printed out for the teacher to gauge student progress 
and identify problem areas. Many schools have 
linked the computers they have in laboratories and 




Photo crodh: Courtmy oftMonti Computor Systonm, Inc. 

Using machines like the National Computer Systenis' 
Opscan 21 , 10,000 tests can be scored In 1 hour. 



clasirooms; networks generally consist of 15 to 25 
computers Unked through a central file server. With 
these local area networks (LANs), the same software 
can be shared among many computers, easing the 
logistics of administration for the teacher. Through 
computers connected by a networked system, pro- 
grams and data can be shared and then sent to 
common peripheral devices such as a printer, hard 
disk, or videodisc. Each computer on the LAN can 
operate independently, using different pieces of 
software for each student, or share software among 
several or all students, enhancing the teacher's 
ability to manage and individualize instruction and 
testing for each child.^^ 

One of the greatest selling points of networks is 
the added tracking and reporting capabilities that 
become possible when aU student data are stored on 
a single storage device such as a hard disk. Stand- 
alone computers with individual floppy disks do not 
have sufficient storage capacity for all of the student 
records in a class or school. In contrast, networked 
systems make It possible to collect extended reports 
on student progress. In large part because of the 
appeal of these assessment features, the number of 
districts with network inst^ations has grown stead- 
ily over the past 3 years, from just over 1,500 hi 
1988-89 to over 2,800 in 1990-91.^^ 



^^Paper and pencils may be used as backup tools» such as scratchpads or woikshects, but they are not the form of entry of fmal answers to test questions. 

"^For further discussion of how school computcre can be networked, see, c.g„ U.S. Congress. Office of Tbchnology AsscssiQcnI. Power OnfNewTools 
for Tbaching and Learning, OTA-SET-379 (Washington, DC: U.S. Government Printing Office, September 1988). 
Q ^7Quaiity Education Data, **1bchnology in Schools: 1990-91 School Year/* Market intelligence (Denver, CO: 1991). p. T-7. 
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Computers are a key feature at the Saturn School of 
Tomorrow. A Mac Lab is available at all times for 
students to do word processing and publishing. 

Integrated learning systems (ILSs) are LANS with 
a comprehensive instructional management system. 
Courseware is typically published and sold by the 
ILS vendor, and spans part or all of a curriculum 
(e.g., K-6 language arts). It is possible to add 
additional software in some ILSs. As in other 
networked systems, instruction is controlled and 
managed through the central computer, which may 
be comiected to printers, modems, videodiscs, or 
other peripheral devices. 

Because of their close linkages between instruc- 
tion ano testing, both of which can be matched to 
district curricula, ILSs have become increasingly 
popular. Although fewer schools have ILSs than 
networks, their number has been growing rapidly 
(from about 3,300 in 1989-90 to abnost 7,000 in 
1990-91).^^ The vast majority of ILS use is at the 
elementary level, with more than 80 percent of ILS 
usage in reading/language arts and mathematics.^^ 

With an ILS testing is an integral part of instruc- 
tion. The testing part of the system Mghlights what 
to teach, and the instructional part is designed for 
easy assessment of student performance. Some 
critics fear this focus on test-based skills reinforces 
a linear and limited approach to learning. Others, 
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At the Saturn School of Tomorrow, students work 
Independently on Integrated learning systems. 

however^ suggest it could help bridge assessment 
and instruction. The importance of networks/OLSs is 
heightened by the fact ihat continued demand for 
these technologies could create opportunities for 
testing-software developers to collaborate with sup- 
pliers of these products. 

ILS vendors include Computer Curriculum Corp., 
Education Systems Corp., ICON, PLATO, Wasatch, 
WICAT, and the Jostens Learning Corp. For exam- 
ple, Jostens' Instructional Management System is 
intended to allow teachers to deliver a customized 
sequence of lessons to each student; direct and 
monitor student progress; adopt the sequence of the 
embedded curricula and prescribe lessons from 
third-party materials; branch students to appropriate 
remedial or enrichment activities; generate criterion- 
referenced pre- and post-tests; create, maintain, and 
update instructional records on each student; and 
electronically transfer records within and between 
schools* 

Although networks and ILSs offer a promising 
way to bring computerized testing into the schools, 
their focus is primarily on classroom instruction. 
The growth in the mstalled base of networks and 
ILSs in schools suggests the potential for their 
expanded application in testing. It is important to 
note that these centralized systems place software 
and test items under the control of one person 
(usually the teacher)* 



«Ibid.,p.T^. 

^'Charles L. Blaschke, ••integrated Learning Systems/Instructional Networks: Current Uses and Trends," Educational Technology, vol. 30, No. II, 
Q ifeml)crI990,p.2I. .^.^ 
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Computers and Testing: Beyond the 
Classroom 

Computer-based testing is not commonly used for 
system monitoring or student selection, placement, 
and certitication in elementary and secondary 
schools. Few schools have enough computers to 
implement a large-scale testing piogram via com- 
puter*^ Even where adequate hardware exists, the 
demand for computerized standardized tests has, in 
tlie past, been low. Tbday 's standardized paper-and- 
pencil tests are a weil-entrenched technology and 
practice* Students, teachers, and the public are 
familiar with test books and ' 'bubble' ' answer sheets 
and the technology is easy to use, score, and 
administer. There is also a well-developed and 
longstanding support system underpinning this type 
of testing. 

In their most basic form, CBT takes existing 
paper-and-r^encil tests and administers them on a 
computei *^ems, format, and procedures remain 
ttie same ai. paper-and-pencil, and the computer ' s 
role is that of an **automated answer sheet.* '^^ 
Computers offer capabilities that make even these 
limited applications more flexible, powerful, and 
efficient. 

Ibsts other than those of academic achievement 
have also become the subject of research in CBT. 
Examples are various psychological tests and tests 
used for admissions, placement, and certification at 
the postsecondary level. The Educational Tfcsting 
Service (ETS) has been pilot testing con^uter-based 
versions of the Graduate Record Examination (GRE); 
both ETS and the American College Tfesting Pro- 
gram (ACT) have developed computer testing pack- 
ages for college placement testing and are currently 
conducting research to verify comparability of 
scores from the computerized and paper-and-pencil 
tests. Finally, there is growing interest in the use of 
computerized tests for professional certification, in 
the military, and in industry for selection and 
placement purposes. 



lb date, research on comparability between 
computer-based and conventional p^r-and-pencU 
tests has had mixed results. Most studies have found 
that students score slightly, but not significantly, 
higher on paper-and-pencil tests than on computer- 
based tests. Although it was hypothesized that 
computer inexperience and computer anxiety might 
exacerbate score differences between testing mod- 
els, this has not been found to be signii^cant. It has 
been suggested, however, that earlier forms of CBT, 
which did not allow examinees to skip items and go 
back and answer them later in the test, or to review 
and change responses of items already answered, 
may have accounted for lower scores on computer- 
based tests.^^ Because of this concern, the American 
Psychological Association Guidelines recommends 
that test publishers perform separate equating and/or 
norming studies when computer-based versions of 
standardized tests are introduced.^^ It should be 
noted that current forms of CBT usually allow 
students to skip items, return to them later, and 
change their answers just as they would in a 
paper-and-pencil test. 

Computerized Adaptive Testing 

An innovation in testing that applies the com- 
puter's rapid processing capabiUty to an advanced 
statistical model is called ^^computerized adaptive 
testing** or CAT In conventional testing all exam- 
inees receive the same set of questions, usually in the 
same order. But with CAT the computer chooses 
items to administer to a given examinee based on 
that examinee*s responses to previous test items. 
Thus, not all examinees receive the same set of test 
items.^ 

The advent of * 'item-response theory* * in the 
1960s led to the realization that relative performance 
of students could be assessed more efficiently if test 
items were selected and sequenced with specific 
reference to individual student ability. Instead of 
presenting a broad r^nge of items to all students, 
some of which are too difficult and some too easy, 
item-response theory allows the range of difficulty 



^James B. Olscn, Apryl Cox. Charles Price, Mike Strozeski, and Idolina Vela. ** Development, Implemcntatiou, and Validation of a Computerized 
Tbst for Statewide Assessment,'* Educational Measurement: Issues and Practice, vol. 9, No. 2. sunmicr 1990. 

2»Isaac I. Bejar, '^Speculations on the Future of Tbst Design,*' Test Design: Developments in Psychology and Psychometrics, S.E. Embrctson (cd.) 
(Orlando. FL: Academic Press. 1985). p. 280. 

^^Steveii L. Wise and Barbara S . Plakc. * 'Research on the Effects of Administering Tbsts Via Computer;. * * Educational Measurement: Issues and 
Pr. nee, vol. 8. No. 3. fall 1989. p. 7. 

23lbid. 

Q '^'Ibid., p. 5. 
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of items to be determined by the test-taker*s 
responses to previous items: 

Adaptive testing . . . seeks to present only items 
that are aiq>iopriate for the test takek s estimated 
level of skill or ability. Questions that are too easy or 
too difficult for the candidate contribute very little 
information about that perscMi's ability. More specif- 
ically, each person's first item on an adaptWe test 
generally has about medium difficulty for the total 
populaticm. Tho^ who answer coirectly get a harder 
item; those who answer incoirectly get an easier 
item. After each response, the examinee's ability is 
estimated, along with an indication of the accuracy 
of the estimate. Ihe next item to be posed is one th^ 
will be especially informative for a person of the 
estimated ability, which generally means that harder 
questicMis are posed after conect answers and easier 
questions after incorrect answers. The change in item 
difficulty from step to step is usually large early in 
the sequence, but bec(Mnes smaller as more is learned 
about the candidate's ability. The process con- 
tinues until there is enough iniformation to place the 
person on the ability' scale with a specified level of 
accuracy, or until some more pragmatic criterion is 
achieved.^ 

The concept of adaptive testing is not new; most 
individually administered tests have some adaptive 
features, and in some group testing in a paper-and- 
pencil format there may be a form of pretest to 
determine student ability and to narrow the range of 
items presented on the main test. However, the 
enormous superiority of the computer in terms of 
storage cf^acity and processing speed has made 
adaptive testing much more efficient. 

Computerized adaptive tests can be used for 
instructional feedback, system monitoring, or selec- 
tion, placement, and certification functions. One 
example is the College Board Computerized Place- 
ment Tfests, developed jointly by tfie College En- 
trance Examination Board and EPS, for use by 
2- and 4-year colleges to assess the readiness of 
entering students for college-level work in English, 
reading, and mathemiatics, and to determine their 
need for additional preparatory courses. These tests 
have been used since the mid-1980s at approxi- 
mately 80 colleges across the United States.^ 

The Portland (Oregon) school district has devel- 
oped a CAT system linked to its districtwide testing 



program. Tlie Portland Achievement Level Ibsting 
(PALT) program, a combined norm-referenced and 
criterion-referenced test battery developed by the 
district, has been the district's principal evaluation 
instrument since 1977. It has been expanded and 
refined regularly to keep up with changes in 
curricula and instructional priorities. All students in 
grades three to eight take the PALT pq>er-and- 
pcncil tests in reiiiding and mathematics twice yearly; 
eighdi graders art expected to meet the district's 
minimum competency levels, and if they fail they 
must repeat the test periodically through high school 
in order to graduate with a standard diploma. 
Roughly 40,000 students (out of a total K-12 
emollment of 55,000) are tested twice yearly. 

The CAT version of the test, known as Computer- 
ized Ads^tive Reporting and Ibsting (CARAT), was 
initially developed over the 5-year period 1984 to 
1989 with annual support from the Portland School 
Board of $250,000 or more. It is expected to be 
implemented districtwide by 1992 under a 3-year $1 
million grant from the school board. It is available 
for students to work on any time during the year. 

CARAT consists of items drawn from the PALT 
item banks. CARAT tests can count for placement in 
special programs (talented and gifted, or Chapter 1). 
However, at present students must take the paper-and- 
pencil test on its electronic equivalent— 4iot the 
adaptive version — ^in order to be certified for gradua- 
tion. 

CARAT began on a pilot basis in six schools in 
1985-86, and has since been implemented in all 
Chapter 1 schools in the district. Computer adaptive 
tests have been used for more than 5,(X)0 students for 
Chapter 1 evaluation and for assessing competency 
in mathematics and reading, grades three through 
eight, since the program was begun. 

District officials hope to have CARAT installed in 
every school by the 1992-93 school year, and 
eventually to shift the entire testing program to 
CARAT They believe that CARAT: 

• makes it possible to test students as soon as they 
enter the district, in order to place them in 
appropriate instructional programs; 



2^Bert F. Oreen, R. Darrell Bock, lioyd O. Hump* *eys, Robert L. Lini!^ and Maik D. Reckase, ' 'Ibchnical Guidelines for Assessing Computerized 
Adaptive Iksis,*' Journal of Educational Measurement, voi. 21, No. 4, winter 1984, pp. 347-348. 

O ^BuDdcnon et tU., op. cit., footnote 8, p. 22. 
ERIC O-n 
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• makes possible more continuous assessment of 
student progress during the school year than 
would he possible from the fall and spring 
testing alone; 

• is available at all times, providing access to 
students alone or in groups at any time and at 
any site; 

• provides ready access to longitudinal test data 
on any designated group of students in the 
school; 

• allows for the shortest possible tests (a CARAT 
test takes about 20 minutes) with known 
measurement properties; and 

• offers enhanced test security, since students 
rarely get the same questions and since test 
questions can be changed legularly.^^ 

The Northwest Evaluation Association has mar- 
keted the Portland adaptive testing system, includ- 
ing the item banks and computerized software, to 
other districts in Oregon, at a cost of approximately 
$16,000. Currently about 15 districts, including 
some other large systems, use PALT-basedpaper-and- 
pencil tests and CAT. 

Computerized Maitery Testing 

One application of CAT, known as computerized 
mastery tests, includes cut scores (the decision point 
separating masters from nonmasters) to assess whether 
the test taker has achieved **mastery** in a field.^ 
Students pass or fail the test depending on how many 
items they answer correctly. If the responses do not 
provide a clear enough picture, additional items of 
similar difficulty are presented until mastery is 
determined. These tests typically require only one- 
half of the questions administered in the conven- 
tional p^r-and-pencil format to reach the same 
reliabiUt}' levels. Reliability is high around the cut 
score. As in the case of Portland, computerized 
mastery testing can be used for minimum compe- 
tency testing, 

Occupational competency testing has also been a 
target of new technological applications. Although 
assessments such as the one designed for the 



National Board of Medical Examiners (see box 8- A) 
serve quite different functions than tests in the 
elementary and secondary school years, they offer 
some important lessons for the capability of comput- 
ers and simulation software. (See also below, under 
'*New Models of Assessment and the Role of 
Tfechnology.**) 

Taking Tests on the Computer: Pros and Cons 

Computer-based testing can improve the effi- 
ciency of standardized test administration and pro- 
vide administrative beneHts when compared to 
standardized paper-and-pencil testing. But like any 
new technology, benefits need to be weighed against 
potential drawbacks. 

Advantages of CBT 

Because questions are presented together with the 
response format (as opposed to a separate answer 
sheet), fV is faster to take a computer-administered 
test. One study showed that CBTs and CATs are 
between 25 and 75 percent faster than paper-and- 
pencil tests in producing otherwise comparable 
results (see Hgure 8-2).^ 

A greater variety of questions can be included in 
the test-builder's tool kit.^^ Constructed response 
items and short answers involving words, phrases, or 
procedures can also be scored relatively ^ ^asHy by 
matching them to the coirect answer (or uiswers) 
stored in the computer. Voice synthesizers can be 
used for spelling or foreign language examinations. 
Computer graphics and video can make possible 
other novel item types or simulations. 

Computers allow new possibilities for items that 
require visualization of motion or complex interde- 
pendencies. For example, a conventional physics 
examination might require long and complex syntax 
or a series of static diagrams to depict motion. On a 
computerized test, motion can be more simply and 
clearly depicted using either a high-resolution graphic 
or video display « A computerized version of the item 
gives a purer measure of the examinee^s understand- 



2'Districi officials note, however, that Computer Adaptive Reportmg and Tfcsttag test items can appear on the paper-and-pencil version of the test 
that counts. The extent of overlap, which could affect test vahdity. hes not been measured. 

2»David J. Weiss and O. Gage Kingsbury. ''ApplicaUons !>t Computerized Adaptive Tbstiog to Educational Problems/* Journal of Educational 
Measurement, vol. 21, winter 1984, pp, 361-375. 

^James B. Olsen, •'The Four Generations of Computerized Tfcsting: Tbward increased Use of Al and Expert Systems/* Educational Technolo^v 
vol. 30, No. 3, March 1990, p. 37. 

^ 30HowardWaincr, **0n Item Response Theory and Computerized Adaptive Usts,**r/i€/i;tt^^^^ 1983, 
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Box S-A—Certification Via Computer Simulations: 
The National Board of Medical Examiners 

A 65-year-old man arrives at the Emergency Department of a ir^jor teaching hospital, complaining of 
respiratory distress and sharp chest pains. He appears to be in acute distress, moaning and holding his hands 
over the left side of his chest The emergency medical technician who brought the patient in says he has a 
history of asthma and emphysema. You are a medical student, and must diagnose and treat the patient The 
entire spectrum of modern medicine is at your fingertips, but time is of the essence in this potentially 
life-threatening condition of respiratory or cardiovascular distress. What do you do?^ 

This is an example of 1 of 25 patient simulaticms in a Computer Based Exam (CBX) that has, since 1988, been 
used at 75 medical schools in the United States and Canada. The ultimate objective for these simulations is use in 
the certification examinaticm of the National Board of Medical Examiners (NBME), required of physicians in 
training before they can become licensed. 

Medical schools have long been concerned that die examinations used to test students are heavy on the recall 
of factual infonnation, but may not adequately test other important indicators of a candidate's readiness to practice 
medicine. One of these characteristics is the ability to employ the skills needed in clinical care— evaluating patient 
symptoms, conducting the appropmit procedures, ordering and evaluating tests, bringing in other experts for 
consultation— in order to accurately and quickly diagnose patient problems and diseases. In the NBME's CBX, the 
examinee is provided a simulated clinical envirmment in which cases are presented for actual patient management. 
Through a blank entry screen Uiat automatically processes ftcc-text orders, the examiner can request more than 8,500 
terms representing over 2^00 diagnostic studies, procedures, medicati(ms, and consultants, and can move the 
patient among the available healUi care focilities. As the examinee proceeds, the computer records the timing of ail 
actions taken. These actions are compared with a codified description of qptimal management based on the 
judgments of expert doctors, and scmng is based cm how well the examinee follows appropriate practice. 
An examinee's management d the case presented above might proceed as follows (seo figure 8-Al): 
The results suggest a diagnosis of spontaneous pneumotfiorax (a collapsed lung), a possibly life-threatening 
disease process. The patient's low blood pressure suggests some degree of caidiovascular difficulty, indicating 
immeJiate decompressicm of Uie patient*s left hemitfiorax (one-half of die patient's chest cavity). Piessing Fl allows 
a review of tests on order. It is cunentiy 16:03; die chest x-ray result will not be available until 16:20 and the 
examinee must decide whedier to treat die patient now or wait until x-ray results are available. She decides to 
perform an immediate needle tiioracostomy (insertion of a needle into the chest cavity to evacuate die air) and die 
computer simulates the process and results: 

The rush of air confirms the diagnosis, but suddenly another message appears on the screen: ^^Nurses 
Note: The patient's pain is more severe/' More action is required. The examinee orders placement of a chest 
tube; once the patient is stabilized, she orders blood to be drawn and additional medical history to be taken* 
The examination continues until, at 16:37, the examinee completes the workup, admits the patient to the 
ward, and leaves orders for foilowup procedures At 16:S0 the message appears on the screen: ^Thiink you 
for taking care of this patient'' 

In tfiis exainple, ti« simulated case time was 50 minutes; it took die student 17 minutes in real time to complete 
the case simulation. Cases can last for mmihs of simulated time; examinees typically are allowed about 40 minutes, 
but usually take 20 to 25 minutes. 

r^OBME computer-based testing is being phased in in stages. In Phase I, results from a 1987 field study were 
reviews, "^y an external advisory panel of experts in medicine, medical education, medical informatics, and 
psychometrics; tiiey concluded the following:^ 

• CBX succeeded in measuring a quality (reasonably assumed to be related to clinical competence) not 
measured by existing examination formats. 

• NBME should continue its current level of developmental activity directed at the ultimate use of the CBX 
in tfi NBME examination sequence for certification. 



^This exaniple is excerpted from K. E. Cotton utid D.M. Durinzi, ComputerBasedExminaiion Software System: Phase I-PhaseJI Update 
(Philadelphia, PA: National Board of Medical Examinexs. 1990). 

2S.0. Clymaa and NA. Orr. **Status Report of the NBME's Computer-Based •Jfest^/ ' Academic Medicine, voi. 65. No. 4. AprU 1990. 
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• Examinations should be delivered thnxigh a system that incoiporates collaborations with medical schools. 

e A phased ^)im)ach should be taken: niase I wod 
could fianiUaiize themselves with the fimnat and participate in collaborative leseaich; Phase n would entail 
formal fidd studies; Phase m would entail extended intramural testing services; Phase IV would email 
introduction in the certificatiim examination(s). 

For the first phase of testing, the case simulations, an evaluatim of eadi student's management of the case is 
offered in the form of qualitative '^case-end feedback," derived from a scoring k^ developed by inteidisciplinaiy 
committees of expert clinicians. The record of action is preserved by the compute and becomes the basis for 
computer grading of performance. Actions are evaluated in several item categcs ies:^ 

Benefit: considered aj^priate and useful in the management of die patient; 

Neutral: representing acceptaUe actions that do not necessarily differentiate one student from another, 

Risk: not required and may result in moibidity; 

Impropriate: represent nonharmful acticms that are not indicated in the management of the patient; 
Flag: indicate that the student did not successfully fulfill the testing objective or subjected the patient to 
unacceptable risk or poor probable outcome, through errors of omission or commissicm. 

Additional data provided include itemized chaiges fi>r services and tests, and a transactim list of actions taken* 



^SteptietiG. Clyinaii, MD,, project dkector for Computer Bused Bum, National Bond of Medical Eiuuniners. pasoaalconimuiiicadoiL 
Novemberl991. 

Figure 8-A1— CBX Case Computer Screen 

Day I (Wed) Time 1 6:03 Location: Emergency Department 



Vitil signi (MD-rM:ortfed) 

Pulse rate (supine) 
S/stolic (supine) 
Diastolic (supine) 
Respiratory rate 

CHett/lung •xamination 



Day I @ 16:03 



1 18 beats/min 
98 mm Hg 
58 mm Hg 
32/minute 



Day I @ 16:03 



Thorax norn^al. Breath sounds absent on the left. 
Hyperresonance to percussion on the left* 

Cardiac examination 

Heart sounds faint Radial, brachial, femoral and popliteal 
pulses weak but equal bilaterally. 



Day I @ 16:03 



SELECT ANY FUNCTION KEY 



F I -ORDER F2-H&P F3-REVIEW F4-CL0CK F5-PAUSE F6-HELP 



SOURCE: K. E Cotton and DM, Durtnzl, CompuHrBm^dExmkmthn Softmir^ $y$tmn: Phm 
hPh0$0 II Updaf (Philadelphia, PA: National Board of Modloal ExmkiM^a, 1990). 



Continued on next page 
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Box 8-An-Oertificatioii Via Computer Simulations: 
The National Board of Medical Examlner$-~Contlnued 

PhsM n eataUt fonnid fieM itttdiM addte^ 
and its derivatWe loofes Ibr uae «t tlM> levd <if clioicd 
aiid8l40-iteiniiiiiltiple-€hok»ei»u^ 

in sufgtty, pedirtrics. intenua nwdidiie, and obitetika^yMcolQgy. Sepmrte tcm woe fsaeiated tor «ch 

measure in eMih diieipline fcr over 1,700 itudenti « 9 

scoring stem tliM codmes crileda spec^ by expert cUokians ^ 

Hie fimUngs to dateareas folkms:^ 

• Student surveys indicated that students believed that CBX s inH i l a ti n n s were more representative of tbe 



tlian were the maItiple<boice qoettkms. 

• Reilability of the CBX scores in which thetr were laige samples langed from 0.70 to 0.80. These findings 
liave beta coisistent across subjects, time, examfoee level ci tsainteg, and madi inteifiace dianges. 

• The validity of the scores k this context is supported by multiple studies to which hKlependen 

of avenge case peifbimanoe by dink^ show high CO 

• Oxrelations between multiple-cbok» and (3X scores in the same disdpline are OM^ 

0.50 oonected for die unreliability of the measures). Assumhig the CBX scores are valid, as supported by 
the above-mentioned ntiog studies, this faidicated tfiat unique measurement information of nneiit to the 
evaluation of medical students is piovided by both CBX and the aiultiple<d^ 

• Analysis of multiple<!hoice questions compoed the computerized veisus paper-and-pencU versions. 
Studenu were tanked rimilariy on both venions,aldiough the computerizedmuldple^^faoicevm^ 

to be more difficult than die poper-and-pencil veiskn by about 25 standard score points (p<^.Oi), suggesttog 
diat use of nomi data ftom the paper-and-pencil tests would be inappropriate for the con^uter-based version. 

Several odier research questions are betog addressed. Thqr include:' 

1. Are tbe CBX scores valid as an toterdiscipUnary evaluation of senior medical studento? 

2. What are effective nieans for weii^g die relative inqKxtance of items and defining 

3. How comparable are different seU of simulatioos to providtog equivalent chaUenges to examinees? 

4. Can simulations be ''disguised" and rrased witfiout jeopaidiztog test fiumess and meantogfiitoess of 

8C(»pes7 

to addition, the Nation Council of State Boards of Nuntog has taken die CBX model and is to die process of 
adapts it to (be model of nurstog education, and researchmg its use for possible cortificatiim examinatkm. 



^UiynbUitaed Natloiial Board of Medical Bxuiilaen data, cited in National Boaid of Medical Bxamlnen, Inurtrn Report on CBT Phase 
//(Pbiladel^liiu.PA:1991). 

^Ctyman, op. c^'., footnote 3. 



tog of the physics concept because it is less 
confounded with other skills such as readmg level.^' 

Alternate modes of response can be used on the 
computer. Keyboarding reduces problems to toter- 
pretmg handwritmg, and the use of tablets, mouse, 
touch screens, light pens, and voice entry can 
provide new data entry foimats. These new sources 
for data toput also open doors for testtog emdents 
with physical disabilities who may be unable to use 
traditional paper -und-pencil testtog iiiijihoas. 



CBTs allow for improved standardization of tesi 
administration. For example, time allowed for any 
given item can be controlled, and tostructions to test 
takers are not affected by variations in presentation 
by human exammers. 

Sch 'ulip g of CBTs is more flexible, smce not aU 
students have to be tested at the same time.'^ 

CBTs are not affected by measurement error due 
to erasures or stray marks on answer sheets. Young 



^'Wise and Flake, op. cit., footnote 22, p. 6. 

JiSee, for example. OcraldBracey,''ComputcatedTfcstiiig:APosslbleAlten«Uve to Pjper and P 
id »0,p.l6. 
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Figure 8-2— Mean lesting Time for 
Different Ibsting Formats 

Mean testing time (minutes) 




I I Computer-adaptive 
testing 



SOURCE: Jam«s B. OImh al., ''Comparitons of Pap«r-Admlnl&t«r«d, 
Computar-Admlnistar^ ard Comput«^r Adaptive Achievement 
Teeta/' Joumai of EducatlonaJ Conput^r R0SMrch, vol. 5, No. 
3J 989. pp. 311-326. 

children, who may have difficulty connecting an 
answer with its associated letter on a separate answer 
sheet, may have less trouble supplying their answer 
directly on the computer. 

Computerized adaptive tests provide greater meas- 
iu*ement accuracy at all ability levels than either 
CBTs or paper-and-pencil tests,^^ because they can 
more accurately discriminate using fewer items. 

CBTs allow for immediate scoring and reporting; 
responses entered directly on the computer can be 
scored and tabulated in seconds, and scores can be 
reported back to the examinee and the teacher 
virtually instantaneously. Rapid feedback of this 
sort can be particularly important for teachers and 
more useful than paper-and-pencil tests that can 
require 6 weeks or more to be scored. 

CBT allows for greater integration between 
instruction and assessment. Students working 
through lessons on an ILS can be assessed as they 
progress. Assessment can take the form of pauses in 
the instructional sequence during which students 
respond to questions or other prompts; with more 
sophisticated tracking software the assessment can 



take place on a continuous basis, providing informa- 
tion to teachers about student strengths and weak-* 
nesses as they work. 

CBTs can provide more detailed information than 
paper-and-pencil tests. For example, student re- 
sponse time for any or all items can offer clues to 
student strengths and weaknesses; tests equipped 
with this feature can keep track of skipped questions, 
item-response times, and other possibly relevant 
data. This information can be use^ to test takers as 
well as teachers. 

CBTs provide a more efficient means to pretest 
new items^ which can be inserted unobtmsively into 
any sequence of questions; faulty items can be 
eliminated and the computer can adjust its scoring 
algorithm accordingly.^^ 

CBTs are more secure than paper-and-pencil 
tests. There are no paper copies of tests to be 
misplaced or stolen, items can be presented in mixed 
sequences to different students, and the number of 
items stored in memory is too large for anyone to 
attempt to memorize. Computerized adaptive tests 
have a particular security advantage: each test taker 
gets essentially a unique test. 

Finally, CBTs may ol^er a set of less tangible 
advantages ov^ paper-and-pencil. Among the is- 
sues researchers are exploring are: whether success- 
ful handling of the technology itself raises self 
esteem of students, especially developmental or 
low-ability students; whether rapid feedback re- 
duces test anxiety; whether students become less 
frustrated and bored with CBT than with ps^-and- 
pencil tests; and whether students are less embar- 
rassed when results are given by the computer rather 
than by a teacher. 

Disadvantages of CBT 

CBTr may introduce new kinds of measurement 
error or may introduce new factors that compromise 
the accuracy of the results. For example, results on 
a mathematics or science test could be skewed if 
poor screen resolution interferes with the student's 
decoding of graphs or images; long reading passages 
requiring the examinee to scroll through many 
screens could favor students with ability to manipu- 
late computer keys rapidly rather than ^ uge relative 



^^BuDdenou ci al., op. cit., footnote 8, p. 385. 
Q "^Waincr, op. cU., footnote 30, 
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reading comprehension profaciency.^^ Input devices 
such as a mouse may be difficult for some students 
to operate, and current touch screens may not be 
accurate enough for sophisticated items requiring 
pointing and drawing. These issues suggest also that 
the lack of experience or familiarity with computers 
and keyboarding may put some students at a 
disadvantage compared to others. 

Most CAT software, because of its branching 
algorithms, prevents exarrAnees from reviewing or 
changing an answer without changing all of the 
items following the changed ones. The effects of this 
rigid sequencing on response patterns and cognition 
are not well understood. 

Results ofCATs are less obviously comparable to 
one another because each student's test is different 
in both the questions presented and the time allotted 
to finish. This may cause a perception on the part of 
students or others that test scores are somehow not 
a fair basis for comparisons.^^ These problems are 
aggravated by the general lack of familiarity with 
CAT on the part of test takers and the general public. 

Ironically, the computer might provide too much 
information: teachers, parents, students, and admin- 
istrators may be unable to digest the large amounts 
of data made available from CBTs.^^ 

Reliability and validity of CBT generally and 
CAT specifically are important issues. Some studies 
have found that CAT can achieve reliability as high 
as conventional tests with far fewer items.^^ How- 
evo:, potential threats to validity and reliabiUty 
warrant careful consideration: for example, issues 
related to content validity, effects of presentation 
mode on construct validity, potential negative ef- 
fects on low-ability examinees, different contexts 
for item presentation, and the uses of data from 
conventional tests to set parameters of CATs. 

Cost Considerations 

Cost factors could pose formidable barriers to 
widespread adoption of CBT. Under current large- 



scale testing arrangements, when masses of students 
are tested at the same time, hardware requirements 
for CBT would be prohibitive. Scheduling students 
to be tested at different times could provide relief 
and would not necessarily create security risks, 
especially if a CAT model is used. But this approach 
would require drastic organizational lianges from 
existing testing practice. Neverthele' it may be 
possible to conduct some large-scale testing activi- 
ties in shared facilities equipped with the appropriate 
testing hardware. Today's college entrance exami- 
nations are not offered in every school, but in 
selected sites on preselected dates; ETS is now 
considering setting up testing sites for administra- 
tion of the GRE and professional certification 
examinations that are supplied with sufficient hard- 
ware to support CBT. These sites could be in schools 
or separate testing centers; in either event, the 
facility would be rented or leased by the test users 
(e.g., a professional association sponsoring certifica- 
tion examinations) for the time required to conduct 
the testing. Schools could adopt this shared facilities 
concept if it were necessary to conduct large-scale 
testing activities during a set time period. 

Test Misuse and Privacy: A Further Caveat 

Fully integrated instruction and assessment, 
hailed by some as the ideal approach to student 
testing, raises important questions related to test 
misuse and privacy. In a word, when testing is more 
closely linked to instruction it may become increas- 
ingly difficult if not impossible to prevent test 
results from being used inappropriately. It is pre- 
cisely the tremendous recordkeeping and adminis- 
trative efficiences of CBT that pose this threat. Tb 
illustrate this concern, consider the ethical dilemmas 
that arise if students do not know they are being 
tested: as long as the information is used solely as 
feedback to teachers and students to improve learn- 
ing, then there would be little objection. But if the 
results are used in high-stakes decisions such as 
graduation from grade school or placement into 
special classes (e.g., gifted or remedial) or made 



^^Research has shown that most people read 30 to 50 percent slower from a compu er screen than from paper. Until screen resolution is improved 
significantly (e.g., 2,0CO by 2,000 lines of resolution), this problem may not be resolved. Chris Dede, George Mason University, personal communication, 
Sept. 3. 1991. 

3^0reen et al., op. cit., footnote 25. 

^Olsen et al., op. cit. footnote 20, argue that too much information was provided to teachers on each child in the Ibiias pilot study. Ilie solution was 
finally to print one page of analysis for each child accompanied by an order form for the teacher wanting additional information. 

^or example, a study of the California version of the Armed Services Vocational Aptitude Battety found that the alternate forms reliability coefficient 
for a 15-item California test was equivalent to that of a 25-item conventional test. Similar findings have been found in other studies. Wise ar^"* Plake, 
C D?r^ footnote 22, p. 8. 
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available to districts and States for accountability 
measures, the concept of seamless integration of 
instruction and assessment becomes less obviously 
attractive. And, in addition to the ethical problems of 
using data derived from tests that students did not 
know were tests, there is also the danger that in the 
long run students (and teachers) will figure out how 
their test results are being used, which would lead to 
distortions in test-taking practice and teaching. 
**1feaching to the test*' and other unintended effects 
of high-stakes testing (see also ch. 2), could under- 
mine the value of integrated teaching and tesiing. 

Other New Tools for Testing: Video, 
Optical Storage, Multimedia 

Video technologies are the newest tools of in- 
struction. The near ubiquity of videocassette record- 
ers (VCRs) in schools makes the use of video mojre 
feasible for testing as well.^^ Furthermore, video- 
discs and digital video interactive also offer new 
possibilities for integrating video capabilities in 
item presentation for more realistic kinds of tasks. 
Often new technologies are combined with older 
formats for innovative testing arrangements. In the 
Oregon Statewide Assessment test of listening 
skills, for example, prerecorded videotapes set the 
scene for questions, whic^ are presented on tradi- 
tional paper-and-pencil multiple-choice tests. De- 
velopers believe that the visual stimuli presented on 
the tape is more realistic and better than having 
questions read aloud from text. The system was first 
used as an element of the statewide assessmmt in the 
spring of 1991."^ 

A more sophisticated optical storage device now 
also coming into use in some schools is the 
videodisc: a large silver platter (resembling a 
long-playing record) that uses analog technology to 
store text, data, sound, still images, and video. 
Computer branching algorithms can be used to 
manage and sequence the vast amounts of informa- 
tion stored on videodisc; this coupling of optical 
storage and computing technology has already 
resulted in some powerful instructional applications. 



either in the form of enrichment materials or for 
courseware, some of which contain built-in testing 
and evaluation components. Researchers in this field 
anticipate new testing applications of videodisc in 
the fiitui'e, given the capacity of the technology to 
store large amounts of multimedia items and inte- 
grate them with testing programs residing in the 
computer, i oiighly one-fifth of American schools 
akeady own videodisc players."^^ 

An application of videodisc to certification testing 
is the prototype developed by ETS to assess teaching 
and classroom management skills as part of the new 
National Ibachers Examination. Hie experimental 
program presents filmed dramatizations of class- 
room management problems that typically occui m 
an elementary school classroom, and prompts the 
viewer to respond to each vignette. For example, 
after watching a scene the viewer may be asked to 
choose the teacher's next course of action; the choice 
activates a branch in the computer algorithm and 
displays the consequences of the choice. 

Cost Considerations 

As with many other instructional technologies, 
high costs of software development coupled with 
uncertainty and fragmentation on the demand side 
have slowed the development of innovative applica- 
tions. However, if videodisc technology becomes a 
more common instructional tool in classrooms, 
software developers will face better prospects for 
retum on their development investments. Without 
some sort of public intervention, it is unlikely the 
private market will produce the kinds of videodisc or 
other high-end technological innovations that could 
make a real difference in schools.^^ There is akeady 
some evidence that State education policies could 
stimulate growth in this market. For example, the 
decision of the Tfexas Board of Education to allow 
videodisc purchases with textbook funds is expected 
to lead to inaeased videodisc use in Ibxas schools, 
and, because of the large percentage of the school 
market that Ibxas represents, this policy is likely to 
spur increased videodisc development and use.'^^ 



3^As of the 1990-91 school year, 94 pcrcenl of all schools have one or more videocassette recorders. Quality Education Data, op. cit., footnote 17, 
p. T-8. 

Evelyn Brczinski, Intcrwest (Oregon), personal conununication, Jaa 3, 1991. 
^^Quality Education Data, op. cit., footnote 17, p. T-10. 

^^For analysis of the instructional software market and discussion of public policy options sec Office of Tbchnology Assessment, op. cit., footnote 
16, especially ch. 4. 

Q ^^Peter West, ^^Tbx. Videodisc Vote Called Boon to Electronic Media,** Education Week, vol. 10, No. 13, Nov. 28, 1990. p. 5. 
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New Models of Assessment and the 
Role of Technology^ 

Most current uses of computer and information 
technology in large-scale testing make the conven- 
tional test fomiat faster and moie efficient than 
paper-and-pencil methods. The computer technolo- 
gies have not, to date, created real alternatives to 
standardized multiple-choice tests.^^ Rather, the 
focus of computer applications has been on the 
familiar psychometric model, with enhancements 
that adapt the number, order, difficulty, and/or 
content of standard assessment items to the re- 
sponses."^ 

There are two possible consequences that may 
spring from this replication. First, such a concentra- 
tion may reinforce existing test and item formats by 
disguising them in the trappings of modem technol- 
ogy, creating a superficial air of advancement and 
sophistication. Moreover, these technical advances 
could make it even harder to break the mold of 
current testing practices, ignoring advances in test 
theory. 

Using Information Technologies 
to Model Learning 

How could computers and computer-related in- 
formation technologies make possible enhance- 
uicnts to the current models of testing? How could 
these technologkes be applied toward assessments of 
a broader range of human ability, cognition, and 
performance? Recent developments in cognitive 
psychology point to fruitful avenues for research and 
development (R&D). 

First, human cognition and learning are now seen 
as constructive processes: seeing, hearing, and 
remembering are themselves acts of construction. 
Leamers are viewed not as blank slates, passively 
recording and recajing bits of information, but as 
active participants v!io use the fragmentary cues 
permitted them by each of their senses to construct. 



verify, and modify their own mental models of the 
outside world. 

Assessment procedures consistent with this view 
of cognition as an active, constructive activity are 
not limited to simply judging responses as correct or 
incorrect, but take into account the levels and types 
of understanding that a student has attained, hnagi- 
nativo new types of test items are required to 
accomplish these ends, along with new techniques 
for scoring items that permit construction of dy- 
namic models of the levels and types of learner 
understanding. Most if not all of diese new tech- 
niques will require the use of computers. This work 
could lead to measures of human cognition and 
performance that are at present only dimly per- 
ceived, because of limited access and inexperience 
in measuring them.^^ 

Second, some research on cognition hokls that all 
learning is situated within ^'webs of distributed 
knowledge.**^ Cognitive performances in real- 
world settings are supported by other people and 
knowledge-extending artifacts (e.g., computers, cal- 
culators, texts, and so forth). This concept chal- 
lenges traditional views of how to determine stu- 
dents' competence. If knowledge is tied in complex 
ways to situations of use and communities of 
knowers, then lists or matrices of abstracted con- 
cepts, facts, procedures, or ideas are not adequate 
descriptors of competence. Achievement needs to be 
determined by performances or products that inter- 
pret, apply, and make use of knowledge in situations. 
It follows from this view that estimates of leamer 
competencies are inadequate if they are abstract or 
without context. 

Computer-related technologies may be able to 
help integrate what is known about how children 
leam into new methods of assessment. This could 
include: diagnosing individualized and ad^tive 
learning; requiring repeated practice and poform- 
ance on complex tasks and on varying problems, 
with immediate feedback; recording and scoring 
multiple aspects of competence; and maintaining an 



^Much of this diacussioi: Is based on Bank Street College, Center for Children and Ifechnologyt Applications in Educational Assessment: Future 
Ibchnolc^es/* OTA contractor report, 1990. 

^Walter Hancy and CJeorgc Madaus, ••Searching for Alternatives to Standardized Ibsls: Whys, Whats, and Whiihtts,'* Kappan, vol 70, No. 9, May 
1989, p. 686. 

^Dexta Fletcher, Institute for Defense Analyses, ••Military Research and Development in Assessment Ibchnclogy,** unpublished report prepared 
for OTA, May 1991. 
^7lbid.,p.A-2. 
O ^Bank Street College, op. cit., footnote 44. 
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efficient, detailed, and continuous History of per- 
fonnances. There are four specific areas in which 
computer technology has begun to demonstrate the 
potential for significant enrichments to assessment. 

Tracking Thinking Processes 

Computers enable certain kinds of processrecords 
to be kept about students' work on complex tasks as 
the work evolves and is revised. They allow the 
efficient capturing of views of students* problem- 
solving performances that would otherwise be invis- 
ible, evanescent, or cumbersome to record. For 
example, it is possible tu keep records of whether 
students systematically control variables when test- 
ing a hypothesis, to look at their metacognitive 
strategies, to determine what they do when they are 
stuck, how long they pursue dead ends, and so 
forth.^^ 

Learning With Immediate Feedback 

Because students can be put into novel learning 
environments where the feedback is systematically 
controlled by the computer, it is possible to assess 
how well or how fast different students learn in such 
environments, how they use feedback, and how 
efficiently they revise. 

Structuring and Constraining Complex Tasks 

Computer environments can structure and con- 
strain students' work on complex tasks in ways that 
are otherwise difficult to achieve. In simulations, 
dynamic problems that may have multiple outcomes 
can be designed, and student progress toward 
solutions can be automatically recorded, including 
time, strategy, use of resources, and the like. The 
tasks can be designed to record students' abilities to 
deal with realistic situations, like nmning a bank, 
repairing broken equipment, or solving practical 
problems that use mathematics. They can show how 
students sift, interpret, and apply information pro- 
vided in the computer scenarios, making it possible 
to measure students abilities in understanding situa- 
tions, integrating information from different sources, 
and reacting appropriately in real time. 



Using Models of Expertise 

In more advanced assessment systems, models of 
expertise can be progranmied and used to guide and 
gauge students* development of understanding in a 
subject area or domain. In this case, learning and its 
monitoring occur simultaneously as the expert 
system diagnoses the student's level of conq)etence. 
This makes it possible to record the problem-solving 
process and compare the student's process with that 
of experts in the field. 

Hardware and Sortware 

Many types of hardware and software configura- 
tions s^ply to theiie concepts of assessment. Tele- 
communications, for example, is an important tool 
for sharing information about altemative assessment 
tasks. Vermont is using a computer netwoik to share 
information on student portfolios that are now used 
for statewide accountability in mathematics and 
writing. Ibachers will be able to share examples of 
work to help develop common standards of grading 
the portfolios, as well as to discuss teaching strate- 
gies and other concerns over the statewide electronic 
bulletion t>oard.^ As shown in box 8-B, anotho" 
example is the use of technology in support of the 
demonstrations of mastery (' 'exhibitions' ') required 
of students in the Coahtion of Essential Schools (see 
also ch. 6). 

There are many examples of attempts to adapt 
generic software tools to assessment: word proces- 
sors, database software, spreadsheets, and mathe- 
matics programs for statistical reasoning. These 
tools can be modified in order to record information 
in a sequence of work sessions and provide snap- 
shots of students' processes in solving a problem or 
task. A word processor can record the stages of 
development of an essay; a spreadsheet program can 
record the steps taken in the solution of a multistage 
problem in mathematics. Because technology-based 
environments siq)port accumulation and revision of 
products over time, they are well suited to portfolio 
mc 'ds of assessment (see also ch. 6). 

As teachers use these tools in teaching, it is 
appropriate^ that they be employed in testing situa- 
tions as wCi I For example, when writing is taught as 
a process using a word processor, students develop 



^^^8 rq)resent8 an extension of basic concepts such as the ' *audit trait * * afaready in use in some instnictlonal software, to assessment. Fbr discussion 
of intelligent tutoring and related concepts, see Office of Ib^bnology Assessment, op. cit., footnote 16, ch. 7. 

^ *teuiy Miller, New England Iblephone, personal communication, September 1991. 
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Box a-B— The IBM/Coalition of Essential Schools Project: Technology in Support of 

"Exhibitions of Mastery''^ 

' 'Planning backwards ' *~lfaat is the tenn for how schoob in the CoaUtion of Essential Schools deteimine what 
knowledge they want tfieir studoits to possess, and what skills tfiey want tfiem to be able to demonstrate ^riwn they 
graduate. At Sullivan High School in CSiicago, eveiy member of die schod conununitv reads and participates in 
semfaiais discussing the works of gkeat men and women, from Aristotle to Martin Luther King, in order to 
demoostmte their abilities to analyze and interpret works of original text Seniors at WaUnook High School in 
Battimoie spend 1 year researching a specific question like "Is the city water safe to drink?" and must present 
findings, answer questions, and defend their positions befiMC a panel of teachers and students, much like a Fh.D. 
student defending a dissertation. At lliayerHigh School in Winchester, NewHampriiire. the &culty work in teams 
of four witfi a group of students for 3*/z hours each day on a set of intenlisciplimuy "essential questions" chosen 
by the teaching team, allowing the students to show the cmmections among multiple disciplines. 

These new teaching ajqnoaches require new assessment ajqnoaches. What is pofaaps unique is how technology 
is being considered from the start as a tool for fiicilitatiug tfie restructuring that such ' 'planning backwards" requires. 
BM has committed $900,000 to the Coalition fsroject at Brown University, along wiA equiimient and technical 

^Material for lUs box U from Hie Brown Uoivenity N«w( Bmew, "IBM and Biowb Uulvenit^ Select Five Hi^ Sdwoli for Natioaal 
'ExMUiioiis of Mutoy' Project." newi release, Jane 26, 1991, i nd David Nlguidula. Cotlltioa for Essential Sctwols. ftovideoce. RI. DersoDal 
communlcMion, December 1991. .t^-v— 

Figure 8-B1— Menu for Coalition of Essential Schools' Exit-Uvel Exhibitions 



PHASE 1: 
VISION 

What should a 
graduate of 
this school 
look like? 



PHASE 2: 
EXHIBITION 

What is the 

call for 
exhibition? 



PHASE 3: 
SETTING 

What logistics 
surround the 
exhibition? 



EKir-LEvm 

EXHIBITIONS 



PHASE 4: 
SAMPLES 

What does 
student work 
look like? 



PHASE 5: 
STANDARDS 



How is 
student work 
assessed? 



li« Crdfdd School - Exhibitions P rtfolio 
|Hodgson Vo"Tech High School ■ Senior Project 
■ " too High School - Preparing to Graduate 
iSuUivan High School - Humanities 



J[>Courte 



CHok on a school name to see Its Exit level Exhibitions 



Click here lor 
Course Level 
Exhibitions 



PHASE 6: 
REFLECTIONS 

Particular 
strengths? 
Unforseen 
problems? 
Other thoughts? 



Click here to leave 
the program 



SOURCE: Coalition for EsMntlal School*, Brown Unlvorslty, Provfdonoo. RI. 
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support, to woik widi these schools and two othen (Eastern High School, Louisville, and English Hi^ Sdiool, 
Boston) to examine how technology can Eacllitate the planning, development, and evaluation ci the "exhibitions 
of masteiy" assessment procedures at tfiese schools. Ibchnology is ejqpected to be used in tfie following wi^s: 

• Research: CD-ROM, videodiscs, con^tuter databases, and tdeoommunications will be used for accessing 
and keq^ track of infocmation the teadiers need for their teaching and the students need for their 
exhibitions. 

• Student-Tbacher Communications: Electronic maU will nukr it possible tot infinmation to be shared 
between students and their teachers both within a school m\ among sister exhibition schools. Project 
nuuugement will be tracked on the computer networks, and file transfers will be made so teadiers can 
"red-pen" student drafts in jHogress. 

c« Performances: Ibok such as WMd processing, desktop publishing, and muhimedia will be used for creating 
student products. 

• Assessment: Electronic portfolios of work in progress and records of student activity throu^out tihe 
exhibition pn^ will be created. Iblecommunicaticxis will be used for assessing exhibitions within and 
among schook. 

An dectrmic exhibitions resource center has been established by the llOf member schools of the Coalition 
for Essential Schools. They are all contributing to diis library of jvactical ideas, methods, and materials, ^ch will 
be available on-line to help Coalition member schools create their own exhibiiions. Hie exhibition resource center 
will provide a foium for discussing exhibitions and receiving updated information (see figures 8-Bl and 8-B2). 

Hgur* 8-B2— Sample Screen When "Visions" is Selected From Menu 



VISION What should a graduate of this school Imow? 



Ctnfraf Park East Secondary School 



Senior ittstifutc 



Graduating seniors will know that they have 
produced quality work in a broad range of 
intellectual areas. Their graduation, then, will be a 
meaningful celebration of achievement, not a 
perfunctory passage. These students will leave 
this school confident that they have developed the 
"habits of mind" necessary to meet the challenges 
of the world into which they enter. 

These "habits" translate into a series of 
questions that should be applied to all learning 
expe.-icnces: 

1 . How do we know what we know? What is the 
evidence? Is it credible? 

2. What viewpoint are we hearing, seeing, 



To vt««rmo(«o(ai» 
exhibition, diick on « j^it 
brtlon. 



Pll.lSf 1 

VISK )\ 



Phase 2: 
EXHIBITION 



Riase 3: 
SETTING 



Phase 4: 
SAMPLES 



Phase 5: 

STANDARDS 



Phase f?: 
REFLECTIONS 



To mturn la tfie Itil of schuoh^ 
dick cn %m Hi«iiu button. 




SOURCE: Coalition for Eswntlal Sdiool»i Browji Unlveralty, ProvWenc^r, Rl; •xampl« from Cantral r^fk liasl 3«?ondary 
School, Now YofK NY. 
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the skills of freewriting, drafting ideas» writing a 
draft, revising, moving ideas around, editing — ^using 
all the tools of creation and revision provided by 
today's word processing software, lb then test these 
writing skills using a paper-and-pencil examination 
would be as inappropriate as teaching a pilot to fly 
a jet and then testing his skills in a hang glider. 
Similarly, students taught to use calculators as 
mathematical tools should be tested on their ability 
to use these tools to cany out mathematical calcula- 
tions. 

The tests under development for certifying archi- 
tects provide an interesting exanq)le of how ad- 
vanced tools available on computers can enrich test 
design and scoring. Examinees use the computer 
tools that allow them to draw, measure, calculate, 
change the size and scale of objects, and extract 
information from databases embedded within the 
testing software (see box 8-C). 

Another category of software includes simula- 
tions and modeling programs that create highly 
realistic problem-solving contexts. Exanq)les can be 
found in most domains, both in and out of school, 
and are available for computers in the schools. They 
enable students to observe, control, and make 
decisions about scientific phenomena that would 
otherwise be difficult or impossible to observe. For 
example, with Physics Explorer, students can con- 
duct and observe a series of experiments that 
simulate the behavior of objects and phenomena 
under different conditions.^^ For example, a student 
can compare the upward acceleration of an object 
under different conditions of gravity. The assess- 
ment includes onscreen records of various experi- 
ments that are conducted; printouts of steps taken by 
the student in the form of note cards, experimental 
parameters, and sequences of decisions; and video 
recordings of students interacting with software and 
explaining their work. Scoring is based on under- 
standing of interactions among parameters, appro- 
priateness of experiments conducted, systematic 
approach to testing of variables, use of different 
information sources, nature of predictions and hy- 
potheses, interpretation of experiments, and quality 
of group collaboration. 

Other computer simulations enable students to 
cany out complex actions by simulating decision- 




Photocr9dlt:MECC 

Wagon Train 1B48. created by MECC, Is an example of an 
educational sInKilatlon program. 



making activity in the sciences, social science, 
history, and Uterature. For example. Rescue Mission 
is a simulation that allows elementaiy school 
students to navigate a ship to rescue a whale tnq)ped 
in a net by learning the mathematics and science 
required to read charts, plot a course, and control 
navigation instruments.^^ 

One of the most promising aspects of simulation 
software for education is the fact that this software 
is aheady in use and popular in schools today, and 
can be supported on relatively inexpensive comput- 
ers. Simidation and modeling programs can provide 
multiple complex tasks and record how students go 
about solving them. They provide opportunities for 
assessing students* skill in such problem-solving 
activities as formulating the relationships between 
variables, troubleshooting or diagnosing problems, 
and bitegrating multiple types of information in 
decisionmaking. 

Video and multimedia systems are a third category 
of technology with applications to new concepts of 
student assessment. VCRs can record the interac- 
tions of students in groups, and the ways they use 
aspects of their social and physical environment in 
accompUshing tasks. Video technologies can record 
continuing activities, products at various stages of 
development, explanations, and presentations in rich 
detail. The video record can be analyzed in minute 



5»Baiik Sfcrsct College, op, cit., footnote 44. 
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detaU over time, much as one would review a written 
record of performance. 

The electronic integration of different media 
(video, graphics, text, sound) has made possible new 
multimedia opportunities for instructional envi- 
ronnients and new, but relatively unexplored oppor- 
tunitie}] for assessment. These developments allow 
multiple forms of media to be stored and orches- 
trated on a single disk, simplifying the ease of use. 

Although the technology for some of these 
projects is currently too expensive for average 
classroom use, costs are expected to drop as more 
powerful computers enter classrooms.^^ Some 
schools have begun to experiment with multimedia 
applications. The Jasper Woodbury Series, for 
example, presents a story through dramatic video 
segments, and enlists the student in solving prob- 
lems using information provided through midtiple 
linked databases (see box 8-D). Jasper, which is still 
in R&D, is being integrated into the science and 
mathematics programs in a number of schools that 
have expressed their willingness to experiment.^ 

Performance assessments often call for student- 
created productions or projects over time as a basis 
for evaluation, and multimedia systems can provide 
rich composition tools to meet this goal. In some 
systems, students can make use of the information 
(in graphic, text, or video formats) available mthin 
a multimedia system as they compose their own 
projects or productions. This makes new kinds of 
student products available for assessment purposes. 
Since students create these productions from within 
these ''closed** systems, traces of their creative 
composition process in choosing and composing 
information can be recorded. 

Finally, intelligent tutoring systems (TTSs), origi- 
nally conceived as instructional systems, have 
recently begun to be adapted to assessment. FTSs are 
based on principles of artificial intelligence and 
expert systems.^^ They combine models of what 



Phoiocrsdlt:BMCofp. 

Ulysses, aeatedfor BM Corp. t)y And Communications 
lnc.» Is an exam|r>le of an advanced Interactive educational 
program oonat}lnlng video, graphics, text, and sound. 

constitutes expertise within a field or domain with 
models of the leamers* own technique — diagnosing, 
evaluating, and guiding student performance com- 
pared to expert performance. Responses of students 
throughout the learning process can be aggregated 
and interpreted in relation to representation of expert 
problem solving. The systems offer the opportunity 
to und^tand student performance not simply in 
terms of correct answers, but in sequences of 
responses that can reveal how a student leams. 

There are very few ITSs available today and their 
focus is typically on instruction, not assessment. 
They are extremely expensive to develop and require 
a higher level of computer technology than most 
schools own. The few in place cover circumscribed 
parts of the cuiriculum, and concentrate on the 
domains where computational power has the most 
leverage and where skills and content are more 
narrowly defined (e.g., science, mathematics, and 
computer science). It is unclear how feasible they 
would be in other areas that are more open-ended, 
such as history or literature. 



WThc digitial video interactive product Paknque, which aUows users to ••explore* * the Mayan aichaeological site via computer and acieen, and to 
consult a variety of visual databases to gather additional data aloQg the way, requires a hardware/software system costing approximately $20,000. It is 
cuirenUy being used in several science museums around the United States. See ibid., p. 26; and Office of Ibchnology Assessment, op. dt , footnote 16. 

^Jasper and other similar systems attempt to capitalize on studento* ever-increasiog familiarity (and comfoit) with television and video, and momotes 
the development of their skills in analyzing and using infonnation provided via video fonnat 

»• • Aitificial intelligence asks the questions: yihtx Is the Ifundamental nature of inteUigcoce and how can we make computeis do the thj^s thftt we 
consider intelligent? ... An expert system is an automated consultant Given a problem, it requesU data relevant to thr^ solution. After^oaVzing the 
problem, it presents a solution and explahu iu reasoning. Bxpert systems are relevant to education because they can represent pioUem-solving expetUse 
and expUin to students how to useit.** See Henry M.Hal£f, ••Instructional Applications of Artificial Intelligciice,**£(fuca//(?mi/Lcfl^/erj/i^^^ March 1986, 
-;^'*^-26, 
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Box 8-C-— Computer Technology for Professional Certification Testing: 
National Council of Architectural Registration Boards 

It is not suiprising that perhaps the most ambitious research on the use of compute technology for professional 
certification examinations is found in the field of architecture: architects often look for creative solutions and new 
ways to solve problrau, using die most advanced technologies. At the same time, because only one-half the States 
require architects to have a college education and only 60 percent of the candidates who sit for the architectural 
boards have a professional degree in architecture, the examination has traditionally played an important gatekeeping 
role, i.e., assuring that candidates who receive natimal ceitificatim meet high standards of skills and knowledge. 
Fiirthermore, since the number of candidates who seek certification is relatively snudl (each year only 4,300 
candidates begin the examination process), field testing is more manageable than in other professions. ^ Several other 
professional groups are following this research with great interest before developing their own technology-based 
testing fcHT professional certification. 

Since 1965, all architecture candidates have been required to pass a multipart unifarm paper-and-pencil 
national examination developed by the National Council of Architectural RpgistraticNi Boards (NCARB). This 
examination, which has been revised p(dodically based on task analyses of the professim, currently consists of nine 
parts, seven of which are traditional multiple-choice tests of discrete knowledge in various architectural fields. Two 
sections require candidates to draw solutions to design vignettes; one sectim involves solving six discrete site 
desi^ problems, while the other entails a comprehensive building design. These sections are scored by juries of 
practicing architects, similar in process to the scoring of Advance Placement examinations (sec ch. 6). 

Since 1985, NCARB has been working with the Educational lasting Service (ETS) in a joint research project 
to develop computer-administered examinaticms. The first phase of the research entailed converting four of the 
seven multiple-choice sections to Computer Mastery Ibsts.^ 

The computer mastery model uses item-response theory to select questions fiom the full item bank, 
reorganizing them into ''testlets,'' each of which provides a collection of questions, which offers precise 
measurement of a candidate's ability. The items within a testlet are presented on the computer. When the candidate 
answers enough questions to determine that a passing or failing score has been achieved, testing ceases. If the 
outcome is unclear, more questions are presented until a clear pass-fail determination has been made. The compUw fir 
mastery tests were pilot tested between 1988 and 1990. They successfully met the desired psychometric standards; 
the computerized tests achieved the same or better accuracy of measurement at the pass-fail point as that provided 
by the current tests, using as few as one-third as many test items as are needed in the paper-and-pencil version. 
However, because the computer tests were offered as an option to paper-and-pmcil testing but were more expensive 
($75 per subject as compared to $35 per subject for the paper-and-pencil format), not enough candidates opted for 
the computer version to make it economically feasible. Since 1990, only the paper-and-pencil version has been 
offered. 

NCARB plans to switch over to computer-administered testing for all seven of the discrete knowledge sections 
in 1997, dropping the paper-and-pencil option altogether. At that point, the second research activity will also be put 
into place. Tbis project involves administering the test (examinees use a mouse or other pointing or drawing device 
to design directly on the computer screen), and scoring the discrete site design vignettes directly on the computer. 
Field testing has shown that design problems that take an average of 20 minutes on the paper-and-pencil version 
require only 5 minutes to complete oa the computer, because of the ease of erasing, redrawing, and adjusting 
drawings. As a result of this research, NCARB expects to be able to present candidates with up to 15 vignettes to 
solve, compared to die existing 6, in the same period of time (see figure 8-Cl). 

Finally, the comprehensive building design problem is being converted to a computer-administered 
examination as well. In this case, each candidate will use two computers, one which presents and serves as the 
''answer sheet** for a candidate's design solutions to comprehensive^ multistep design problems; the other mcmitor 
provides the ''model architect's office,'' containing all the design tools, rbsources, and reference manuals needed 
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^Jeffrey F. Kemcy, director of BxaminatioDS Developmeat» National Council of Architectural Registration Boards, personal 
couununication, October 1991. 

^''Breakthrough Development in Computerized Ibsting Offers Shorter Ibst, More Precise Pass-Fail Decisions,* * ETS Developments, vol. 
33» Nos. 3 and 4, whiter/spring 1988, pp. 3-4. 
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Figure 8^1— Example of an NCARB Site Design Vignette 



A recreational center olan must accommodate a club house in its present position, as well as 
tennis courts, pool, rs, and a service building. Prepare ihe site plan according to the 

following objectives. ^ , .^reserve all trees. (2) Bleachers shall serve the tennis courts. (3) Pool shall 
be adjacent to the clubhouse. (4) Service building shall relate to the club house and the parking lot. 
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to c(wij|dete each task. Each of Ae substqiM in the comprehensive design problems will be presented as separate 
sections and scored separate^. Fot example, a candidate may be asked to design a libnuy that meets cectain site 
and cUent lequirnnents. In the first stq), the "bubble diagrams' ' diat id^ 

A second section would requite taking a block diagram and relating it to the site requirements in terms of li^t, 
ground contours, zoning, and other constraints. Each of these individual fxedesi^i tasks will be scored sqnrately, 
making it possible to give a candidate partial credit, histead of scoring the building problem as a whole, as is done 
in the existing paper^d-pevKil fbimat. 

In 1989, six different item types were developed fot the simulations and pilot tested at an architectural firm 
and the NCARB annual meeting. It is anticipated that the computer format will permit more reliable assessment of 
candidates' abilities. Whereas it now takes candidates up to 12 hours to complete 1 conqndiensive pn^em 
(typically 4 hours to come up with a design and anodic 8 to put it down on paper), using the comfmto: simulations 
broken down into subtasks, up to 10 design samples can be iwesented over a perikxl of 5 to 6 hours. It is antidpaied 
that periups 3 comprehensive building problenu, with a total ci some 20 to 30 subtasks can be administered for 
this poitlcm of the examination over dke same time period, giving the State board exammation a fuller and more 
reliabte picture of an architect's design skill and ability to meet die necessary healdi, safety, and environmental 
standards. 

Researchers are mcouraged by the progress mad'*, in the design of the conrputM interfaces; indeed, erasers, 
drafiiog tods, measuring tapes, calculators, and odier design tools that make it possible to move and adjust drawhigs 
are available in many computers today, as are the appropriate data storage and retrieval c'^ilities. The hardware 
required is Windows-based 386 machines whh proximately^ 4 megabytes of memory. Ad^ ances in object-<»iented 



ContlnuBd on next pago 
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Box 8-C--Computer liechnology for Professional Certifiaition Tea^np 
National Council of Architectural Registration 8oard»~Continued 

[Mvgnoiuiiiiig make it poesiMe to use icmiii fior fiaqueotty used ticfai^ectmil omporieiits (e,g., conidocs* dom, 
walls, and wImIows). It is th^ developmnit of dcoriog psydKMuetries tiiat poses iaigest ifseaich dudloige. In 
Older to devek)p scmiog protoocds, 

pnctitioiieni list the dianictcfistics of m appnyriate sdotioo to a paitkriilar uxoblem and these judgmems aie 
progfanmied into the 

In cases «4iefe the expert ^stem is uo<d)le to 

panete of judges when two scofm disagieeX a master 

Although the origsoil taiget goal was 1997»^ the NCAKB/BI^ team 
antidpated and, if piogwss conthiiies as die saine 
eununatioii system could be possible k 19!^.^ 



^^RidMMd DeVon* lenior exaoincf . Center for Oc(nq«t)oiud nd Plrofetsioiiai Asfeisment, E d o ca ti on al Ikttiqg Service, pcnoail 
coffltiwmiciHoPi Oct 15$ 1991. 

^Willittii Wleio H, ''UccQse Euou by Cutivuter/' Archiiecmrd Record, vol 179. No 7. July 1991. p. 80. 



One of the greatest concerns with ITSs is that, like 
all testing activities, they may gravitate toward 
promoting the skills that tfiey measure best. These 
skills tend to be algorithmic and routine. At the same 
time, educators are concerned that we may not bs 
focusing our efforts on developing in students those 
thinking skills dependent on complex knowledge. 
The skills required for understanding a written 
passage, writing a composition, solving a problem 
that has many steps and approaches, interpreting the 
results of an experiment, or analyzing an argument 
are not j?o easily bitoken into discrete components. 
Furthermore, attempts to segment these skills may 
result in analysis diut fails to capture the overall 
picture of whf>t makes up true competence. Creativ- 
ity may be neither recognized nor rewarded in 
existmg ITS models. 

Toward New Models of Assessment: 
Policy Issues 

A main fmding of this chapter is the gap that 
separates current applications of information tech- 
nology in testing from a vision of fundamental 
reform in the assessment of hu? \m learning and 
educational achievement. In su ii, computers aiid 
other data processing equipment that have made 
possible a **mass production** testing technology 
could become essential in the design and implemen- 
tation of new testing paradigms. 

Computers and related technologies have proven 
indispensable to research on human cognition, and 
O ' jssons from this research are, in tum, being 
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applied — ^also with the help of sophisticated con[q)uter- 
based systems— to the design of educational assess- 
ments that correspond to the growmg body of 
research on learning. The research community, 
though still fragmented, has begun to coordinate the 
efforts of cognitive theorists, computer scientists, 
subject matter experts^ and educators. These early 
efforts have led to particularly promising break- 
throughs in the application of technology to im- 
proved classroom (Uagnosis and instructional feed- 
back. Whether these efforts will eventually also 
contribute to the creation of tests that can be used for 
other functions, such as system monitoring or 
student placement and certification, remains to be 
seen. In any event, it is not clear that these latter 
functions of testing require the diagnostic specifi-^ity 
of computer-based learning and assessment tools. 
Overall, most experts would agree that applications 
of computer technology to new forms of assessment 
are still at a very rudimentary stage. The road ahead 
is a long one. 

Research Support 

Policymakers face a formidable dilemma: reach- 
ing the as-yet uncharted territor>^ of new assessmenc 
models requkes investments in technologies that 
have uses in the cmtent paradigm of testing and that 
render that paradigm ever more efficient. Increased 
efficiency encourages reliance on old models of 
testing. This problem is manifest in the arena of 
fionded research: much of the research on test theory 
and new technolog>* is funded by commercial test 
companies, which face strong incentives to reinforce 

2S») 



chapter 8— Information Technologies and Testing: Past, Pmsent, Future • 277 



Box 8-D— The Jasper Series: A Story is Worth a Thousand Questions 

TheNatioaal CduncU of Tbachenof Mattiematics has suggested that the madieiiiatics cunicuhim should: 

. . . engage studosts in proUems that demand extended effort to solve. Some might be groqp pfojecta tfiat lequiie 
ftudo^ to oae «vaihd)te tecteology and to eoga^ 
in^oftam criteiion of pfottena U 

The Jasper Woodbury Problem Solving Series is a video4N»ed adveotuie series designed to he^ students 
develop tfidr sldUs hi sohrhig mathematical pioblms*^ Eadi of the six video segmmts & fitom 14 to 18 mhiutes 
loi^ and piesents a dfanuttk advemufe fiMMiring Jai^ 

posed at die end of eadi segment to see how the stoiy ends, (Thm is a sohition shown on the video ttiat students 
see only after diey have solved the pioUems diemsdhres.) AHhough die pnMems aie oonq)lex and lequiie many 
ste^ ^* all die data needed to soWe die problems are contain^ 

For exanq>le, die adventure '^Rescue at Boone's Meadow*' begins widi Jasper's firiend Lany flying his 
ultralight airplane. Lany teaches Emily die prhiciples she needs to know hi order to fiy solo m the plme: fiiel 
capacity, speed, payload limitSt how the shq^e of die wiog prod^ 

she, Lapnry, and Jasper ceMnate at a local restaurant They discuss Jasper's upconung fishhig trip, and his plan to 
hiloe IS mflesmtodie woods at Boone's Meadow. Details presented as a part of die unfbld 
important data durt students will htter need to use hi sotvii^ die poblem. The next scene shows Jasper alone in die 
deq> woods, peacefldly fidiing, ^rfien a shot rings out % 

been seriousfy wounded, and radios for hel^ on his CB radio. Emily receives his message, contacts a local 
veterinarian, and is told durt time is of die essence hi rescuing die eag^. The stc»y ends wiUi Emily poshig die 
que stion: ^ ^ What' s die fiEuMest way to rescue die eagle and how kmg will it talce? ' ' Ihe students, no longer passhre 
watchers, have to put themselves hi the role of Emily and sdive die problem ushig data contafaied hi die video.^ 

Researdiers, workhig whh teacheis and students hi 9 States, have found dutt students become extremely 
engaged hi die proUon-solvhig ^s. Ibachhig strategies vary, but most teachers beghi widi large group activities 
and dien move mto smaller cooperative leamir^ groups, gu^g dhe studoMs to consider a variety of solutions, hi 
die episode summarized above, for example, if die students contemphde usfaig die ultralight plane as a rescue 
vehicle, diey must take into account landfa^ area, fuel consumption, paylcwd limitations, speed, and other 
hifonnation that can be reviewed by going back faito the videodisc. Groups tj^ically spend a minimum of two 1 -hour 
chiss periods woridng out their sdution, and then must present and defend dichr plan to die entire class. 

One of the research goals has been to create new ways to assess the leandng diat occurs hi sol vhig problems 
presented hi die series. One-on-one hiterviews widi students were found to be much too time consumhig. 
Paper-and-pencil tests were developed, asldng shidents to list ami expUdn die lands of sub|woblems that Jasper and 
his jGriends needed to consider to solve eadi pr^lem. Transfer problems, sunihtf to die (Hoblems m the series but 
mvolvmg new settmgs and data, were also ghren. Ahhough die pq>er-and*pencil assessments showed diat leanting 
occurred, diere was one proUem:teadiers and students hated than! Ibacherss^ ''My kids, as much as they liked 
Jasper, as much as diey bc^ed for Jaspo*, finally told me: *If I have to take another test on Jasper I don't want to 
see amither Jasper' "; (nt ^4t seems to me dutt we're really asking kids to do someddng strange when we've 
ktroduced this wonderful techndogy and we've gotten them involved in the video experience. . . . Then you give 
diem diis test diat's on pqier."^ 

How dien should die students be tested? One approach has bcsn to explore ways technology can be used m 
the assessment process, hi May of 1991 die researchers produced an experimental teleconference, the Challenge 
Series , a game ^ow format featurhig duee college students as contestants, each oi whom cUdmed to be an expert 



^Nadonal Council of Ibacben of Maltieiiiatlci» Curriculum and BvaiuoHon Standards for School Mathematics (Rcstoo^ VA: Maich 

1989). 

^tbc Kriea it a icaeaicli and devek)pnitttt project of the Cognition and Ibdwlogy Qfoup at Mndeitilt University, supported by the 
James S . McDonnell Fonndatioo. the National Science Foundation* and MnderMlt Univenity . 

3Cognition and Ibchndogy Owp at MuodeiUlt Untvenity. '*11ie Jar er Bqpetimeat: An Explomtlon of Issuea in Learning snd 
Instnictiottal Design,** July 26» 1991» p. 7 (fofthconOEf in Michad Hannafin and Simon Hooper (eds.)» Education Tkchnohgy Research and 
Devehpmem, q)edal issue). 

^Cognition and Ibdmoiogy Orotip at >iuidecUkr Untterstty* *'Tbe Jaqwr Series: A Generative Approach to Iniprovii« Matibematical 
ThiiddQg,** pp. 1M2 (foitbcomli^ in American Association for tfie Advancement of Science, This Year in Schooi Science). 

Continuod on next page 
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Box 8-D— The Jasper Series: A Story is Wortli a Thousand Questions— Continued 

on flight and on the Jasper adventure "Rescue at Boone's Meadow." While the contestants all answered questions 
c<»iectty on the first nxmd, by tfie fouiA nxmd everyone except the true expert had made some eironeous 
aiguments. Would the students be fooled by anfccs, or could they identity the real expert? They called in their votes 
and 85 pacctkA of the students conectty identified ibt true expert. Enthusiasm f<Hr this foita of "testing" was slnr 

Other ideas building on the teleconference motif are being considered for each of the Jasper adventures. There 
are also {dans to he^ teachers engage in formative evaluations of student learning following each Jasjwr adventure 
with video-based "what if' analogs like the ones used to prepare for the Challenge Series teleconference. Spinoflf 
vignettes that connect with other parts of the curriculum (e.g.. an exploration of Lindbergh's historic flight fiom 
New York to l^uis) are also in progress. Fmally . the researchers are designing a piototype set of computer-based 
"students" or "tutees." Hie studenu must teach the "tutees" how to soWe Jasper problems, and their progress 
is tracked by die computer. This approach may be linked with the teleconferences. For example, the students could 
teach wmqwter-based tutees, who would tiien conqpete in a game show where die tutees become game show 
contestants. The class that did die best job teaching its tutees whis. 

The seven design principles underlyhig die Jasper Series, and their hypothesized benefits, are summarized in 
table 8-Dl. 

Table 8-D1— Seven Design Principles Underlying ttie Jasper Adventure Series 



Dwlgn principle 



Hypothelredbwwflts 



1.Vldeo4)a8ed(onnat 



O ■ 
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a. More motivating. 

b. EMtortotMich. 
a Supports oomptax comprehension, 
d. Etpodaly helpful for poor reedera yet It can 

aloo8i4)|M>rtreiKino. 

a. Ettlor to rememt>er. 

b. Moreenonglno. 

aPrimao studants to notice the relavance of 
mathennatica and reaaonlno for every^ evenly 

& Motivates studants to datermine the ending. 

b. Itechas studants to find and define problems 
tobSiiolved* 

c. Providesanhanoedoppurtunltlesfbrreasonlng. 

a. Permits reasoned decisionmaking. 

b. IMotivates students to And. 
a Puts students on an ''even lieel" with respect to 

relevant knovvledge. 

d. Clarifies how relevance of data depends on 
specif lo goals. 

a Overoontas the tendency to try for a few 

minutes and then give up, 
b. introduoeslevelsofcomplexltycharacterlsticof 

real problems. 
0. Helps students deal with complexity. 
Develops confidence In abilities. 

a. Provides e)dra praotlce on core schema. 

b. Helps darify what can be transl^red and what 
cannot. 

a llustrates analogical thinking. 

a Helps extend mathematkal thinking to other 

areas (e.g.. history, sdenoe). 
b. Encourages the integratton of knowledge. 

c. Suppor^-^ I nformatton finding and publishing. 

SOURCE: Coonltten nod T«chr>ok)gy Qroup at Vand«rfollt Unlvartrty, JMpar BcpfrkiMnt: An E)(pk)r«tlon of 
toiiu#t In Ueming and Initruotlonal Datlcn/* July 28, 1M1 (forthoomino In Miohaa) Hannafin and Simon 
Hooper (adt.), Eckmthn TKhnohgy War we/i Mnd [h)/0lopm$nt, •padai Itaua). 
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2. Narrative with reallstto problems (rather than 
a lecture on vkleo) 



3. Qeneratlve format (i e., the stories end and 
students must generate the prob: ms to be 
solved) 

4. Embedded data design (l.e., aH the data need* 
ed to solve the problems are In the vkieo) 



5. Problem complexity (l.e., eadi xlventure In- 
volves a problem of at least 14 steps) 



6. Pairs of related adventurer 



7. Links across the cuniculum 



Chapter 8— Information Technologies and Testing: Past, Present, Future • 277 



Box 8-I>-The Jasper Series: A Story is Worth a Tlioiisand Questions 

The Naticml CmncU of IfcadMn of MMI^^ 
. . . «ng«|e itiideiM ia imlilami llttt demaod ex^^ 
stadMttt to oie tvailable tMlnotogy tD^ 
iii90ft«Dt cfileikn of pi^itoiM it tf»t tft^ be 

Hie Jasper Woodbury Prohim Solving Series is a video-baied adventure series designed to lielp students 
develop (heir skiUs in sol^ riitVffmtH^ problems.^ Bach of the six video segments Si from 14 to 1 8 minutes 
long and presentsadnunMic adventure feattiring Jasper and his friend 
posed at the end of each segmimt to see how the stoiy ends. (Hiere is a sohition shown on the 
see only after dxy have sohred the problems themselves.) Ahhoiigh the imblenis are complex dnd requite many 
steps, aU die data needed to soWe die problems are contained as a m 

Fbr example, the adventure "Rescue at Boone's Meadow" begins widi Jasper's friend Lany flyhig Ms 
ultralight airplane. Lany teaches Emily the principles she needs to know m ofder to fly soki hi the phme: ftid 
capacity, speed, payload Umits, how the shape of die wing produces lift, and so on. After Emily's maiden solo flight, 
she, Lany, and Jasper celebrate, at a local restaurant They discuss Jasper's upcoming fishing trip, and his plan to 
hike 15 ndles bto the woods at Boone's Meadow. Details presented as a pan of die unfi>ldhig adventure become 
imp<xtam data that students will later need to use in solving the Tptdbltm. The nva scene shows Jasper alone in die 
deqp woods, peaceMy fishing, "(rfien a tiiot fhigs out. He nms hi the dfaection of die so^ 
been seriously wounded, and radios for hdp on his CB radio. Emily recehres his message, contacts a local 
veterinarian, and is told tfiat thne is of the essence in rescuhig die ea|^ The stoiy ends widi Emily poring die 
que^: ' 'What's die fastest way to lescue nie eagle and how loog wai it take?" The studenti, no knger passive 
watchers, have to pot diemseWes hi die role of Endly and solve die problem ushig data oontahied hi die vkleo.^ 

Reseaichen, woridng whh teachers and Ktudents hi 9 States, have fbund dut students become extremely 
engaged hi die pioblem-soWfaig tasks. Tbachhig strategies vary, but most teachen begfai widi huge group activities 
and dien move hito smaller coq>erBtive leaminft grnqis, guidfaig die students to consider a variety of soludons. hi 
die episode summarized above, tor example, if title studenu oontempbte usfaig die ultralight phme as a rescue 
vducle, diey must take into account landfaig area, fiid consumption, payload limitations, speed, and odier 
mfbnnation dutt can be reviewed 1^ gohv; back hito the videodisc. Gtoups typic^ 
cbus periods workhig out dieir sofaitkM, and thm must presem and defend dieir pUm to 

One of die rr«eandi goak has been to create new ways to assess die leandng dutt occun hi solvfaig pro^ 
presented hi die series. Ono<m-ooe hitetviewtt widk students were found to be much too thne consundng. 
Paper-and-pencil tesU were developed, a^dng -itiKlenti to list and exphdn die khids of subproblems diat Jasper and 
his friends needed to consider i :o sohre each pkobklm. lYansfer problems, shnihir to die problems hi die series but 
hivolvhig new setthig<i and data, were alao givm Ahhough die paper-and-pencil assessments showed diat leandng 
occuned, dieie was one problem: teachen and 'A itnts haf'ri dieml Tbacheis ^ 

Jasper, as much as diey begged for Jasper, fir i% told me: 'If I have to take anodier test on Jasper I don't want to 
see anodier Jasper' "; or seems to me that we're really askhig kkls to do someddng strange when we've 
introduced this wondraftd teohndogy and we've jaotza diem involved hi die video experience. . . . Tbm you give 
them dils test diat's on paper.'"* 

How dien shouh! die students i)e tested? 0r9 approach has beeo to exolore ways technology can be used hi 
die assessment process. Is y2ay of dw reseatchen f»oduced an experimental teleconference, die Challenge 
Series, a game show format f^ti^kg thi«e colkge students as contestants, each ttf whom ctohned to be an ejipen 



^NadoMl Coondl of Iteoben of Maibiinalict, C rrkubm md Bv Jucdon Standard* for School htathtmatia (Rfistoa, VA: Mnch 

1989). 

^Tbe Mries i« a lesMtidi lod dc^iofiiwiit sr^^ of ^ Cognition Mid 'ftdinologjr Group at \tedeibUt University, nqiported by die 
James S. McDwDeU IkraDdatloo. the National Science Fovnditlon, and \lndeifaUt Udvecilty. 

Scognltlioii and Itetmology Qmf at Mttdeitittt Vtiivacslty, "The Jaqter BxpeiimeaL- An Exploration A Issues In Learning sod 
Instnictlonal Design," 26, 1991, p. 7 (fotthoom^ la Michael Haunafin and SImoo Hooper (eds.), Educadon Ikchnology Restarch and 
DeveU^mtiu, special issue). 

^Cognition and 1bchn(dog> Qrot^ at MuiderUlt Univenlty, "The Jaqwr Series: A Oeoeiative Approach to ^nprovhig Ma&nnaticidi | 
TUiiUng," pp. 1 1-12 (forthcoodug hi American Associatioa for the Advancement of Science, This Year in Schni Science). 
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the economic and educational advantages of the 
conventional test paradigm. This is in contrast to the 
test development process in other countries, which 
is usually undert^ea or supported wholly by the 
government. Just how far the commercial research 
community will go in experimenting with nontradi- 
tional test designs, without external support, is 
uncertain. 

It is important to recall, ho ever, that Federal 
intervention frequently played a :iitical role in the 
history of research, development, and implementa- 
tion of new testing technology: perhaps the best 
example is the Army testing program during World 
War I (see also ch. 3), which provided the most 
fertile ground imaginable for proving the feasibility 
of new forms of testing, such as group administra- 
tion, as well as statistical models based on normative 
comparisons and rankings. 

Indeed, the military has since then remained a 
major player in the development of personnel 
selection and placement tests, assessments of basic 
job skills, and experimentation with a variety of 
models of performance assessment. Some of these 
advances have spilled over into the civilian arena.^^ 
In addition, there is the more recent example of 
National Science Foundation (NSF) support for 
research leading to development of tasks used in the 
1988 Natioital Assessment of Educational Progress 
(NAEP) science assessment. Not only were these 
items viewed as important innovations in NAEP, but 
many of them were then adopted by New York State 
for its statewide fourth grade hands-on science 
assessment. Similarly, Department of Education 
funding for NAEP has supported research into 
constructed response items and innovative testing 
formats. Thus, while federally funded research on 
assessment has not been large, it has been an 
important complement to the large R&D projects 
financed privately — such as those by ETS, ACT, the 
National Council of Architectural Registration 
Boards, the National Board of Medical Examiners, 
and computer companies such as IBM and Apple — 
or financed by States and districts, such as in 
California and Portland, Oregon. 

The history of testing in the United States teaches 
that the Federal Government can be a catalyst for 



reform, through support for expansion of existing 
technologies and through support for basic research 
leading to new technologies. The Federal Govern- 
ment could continue to support basic research and 
applied development of a wide range of new models 
of testing. Specific options include: 

• earmarking resources in programs like Chapter 
1 for research into how advanced technologies 
can improve testing; 

• continuing to fund educational laboratories and 
centers for school-based research on assess* 
ment; 

• providing grants to independent researchers. 
States, and school districts through NSF or 
other existing programs; 

• coordinating the efforts of the many research 
players both within and outside the Federal 
Government's research network, i.e., Federal 
laboratories, the National Diffusion Network, 
NSF Net, and Star School Programs in support 
of improvements in testing; and 

• supporting the exchai? ;e of data among the 
many States and districts involved in pioneer- 
ing tibeoretical and practical research. 

Infrastructure Support 

If computers, video, and telecommunications 
technologies are to play a significant role in assess- 
ment, a combined ''technology-push/market-puU'* 
strategy will be necessary.^*^ Ibchnology-push in 
thl^ context focuses on the technology ol software, 
and is shorthand for software development support 
that could lead to increased demand for computer- 
based instructional and assessment systems in 
schools. The market-puU side of the equation refas 
to direct investments in hardware: increasing the 
installed base of technology in tb"^- schools could 
lead to increased demand for goou software, which 
could in tum create improved economic incentives 
for software developers and entrepreneurs. 

lb make inroads in this interrelated system, the 
Federal Government could support investments in 
CBT facilities that could be shared among schools 
within and across districts. This could entail invest- 
ments in commimications technologies to link hard- 
ware already in place, along with software and 
training. Another approach would be for schools to 



^^Thc flow has gone in the other direction too: assessment techniques developed for educational institutions have been adopted by the military, 

^See also Office of Tfechnology Assessment, op. cil,, footnote 16, for discussion of this approach to fostering improved instructional software 
O )pment 
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lease their computer facilities to the Federal Govern- 
ment for use in its large education and training 
programs, or to other outside users (adult education, 
business, professional groups). The idea is to utilize 
the capacity of the hardware that exists in schools 
now, or the hardware that could be installed in the 
schools, during nonschool hours, and to reinvest the 
revenues in testing-related hardware or software 
technologies. Federal support for purchase of multi- 
purpose computer and video technologies for testing 
activities under existing Federal programs, such as 
dapt^ 1 , Magnet Schools, and Bilingual Education 
could build up the infrastructure of testing technolo- 
gies. 

Continuing Professional Development 
for Teachers 

Ibachers are the most important link between 
instractional or testing technologies and the students 
whose achievement and progress those technologies 
are intended to affect. The problem is that few 
teachers have adequate preparation in the theory and 
techniques of assessment. This gap in teacher 
education is not limited to (he arcana of psycho- 
metrics, but extends even to the design and interpreta- 
tion of classroom-based tests.^^ At the same time, 
many teachers have not yet come ^ ^online'' with 
computer use.^^ While teachers may be learning 
about computers faster than about testing and 
assessment, most teachers have not been exposed to 
continuing professional development dmed at help- 
ing them master the implications of matching 
technology and new approaches to testing. 

Federal support for teacher development could 
have two beneHt streams: Hrst, it could result in 
greater acceptance of new testing and assessment 
technologies, which would in turn lead to height- 
ened demand for jinovative software products; and 
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Teachers need help In learning to use teaching tecttnology 
for testing purposes. 

second, it could involve teachers in the early stages 
of testing technology development, which could 
make the technologies that much more relevant. 

Leadership 

In 1990, the Ptesident and the Governors adopted 
ambitious education goals to be met by the year 
2000, and there has been much discussion on 
developing new tests to measure success in meeting 
these goals. The Federal Government has the oppor- 
tunity to provide guidance in a time that has been 
marked by many suggestions for improvement and 
much accompanying confusion. Congress could 
take a leadership position in guiding, shaping, and 
supporting a vision of education that links learning 
with assessment in a rich, meaningful, engaging, and 
equitable fashion. 



^Scc, e.g.. John R. Hills, •'Apathy CoDccniing Grading and Tfesiing/ ' Phi Delta Kappan, vol. 72, No. 7, March 1991. 
1^ ri9^" ^'Sce» e.g., Bank Su*ect College, op. cit., footnote 44. 
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APPENDIX A 

List of Acronyms 



ACT — ^American College Ibsting Program 
AERA --American Educational Research 

/Association 
AP — ^Advanced Placement 
ASAP — ^Arizona Student Assessment Program 
ASP — ^Accelerated Schools Program 

CAP — California Assessment Program 
CARAT — Computerized Adaptive Reporting and 
Ibsting 

CAT — computerized adaptive testing 
CBT — computer-based testing 
CBX — Computer Based Exam 

CEE — College Entrance Examinations 
CES —Coalition of Essential Schools 
COMP — College Outcome Measures Program 
Compact — Connecticut Multi-State Performance 
Assessment Collaborative Ibams 



CR 


— constructed response 


CRESST 


— Center for Research on Evaluation, 




Standards, and Student Ibsting 


CRT 


— criterion-referenced test 


CUES 


— Continuous Uniform Evaluati(Mi System 


EC 


— ^European Community 


EPT 


—English Placement Tfest 


ERA 


— Education Reform Act (United Kingdom) 


ESDs 


—Essential Skills Documents 


ESEA 


— Elementary and Secondary Education Act 


ESPET 


— ^Elementary Science Program Evaluation 




Ibst 


ETS 


— Educational lasting Service 


FERPA 


—Family Education Rights and Privacy Act 


GCSE 


— General Certificate of Secondary Education 


GED 


— General Educational Development 


GKAP 


—Georgia Kindergarten Assessment Program 


CPA 


— grade point average 


ORE 


— Graduate Record Examination 


HOTS 


—Higher Order Thinking Skills 


TLS 


— integrated learning system 



rras —Iowa Tfcsts of Basic Skills 
rrS — ^intelligent tutoring system 

JFSAT —Joint First Stage Achievement Tfest 

LAN — local area network 

LEA — local education authority 

LSAT —Law School Admissions Tfest 

MAP — ^Monitoring Achievement in Pittsburgh 

MCr — minimum competency testing 
MOE — Ministry of Education 
NAEP — ^National Assessment of Educational 
Progress 

NAGB — ^National Assessment Governing Board 
N3ME — National Board of Medical Examiners 

NCARB — National Council of Architectural 

Registration Boards 
NCE — ^N(»inal Curve Equivalent 
NCTM —National Council of Tfeachers of 

Mathematics 
NIMS Y — not-*in-my -schoolyard 
NRT — norm-referenced test 

NSF — ^National Science Foundation 
PALT —Portland (Oregon) Achievement Level 
Tfesting 

PLR — Primary Language Record 
R&D — research and development 
SAT —Scholastic Aptitude Tfest 

SAT — Standard Assessm^u Task 
SEA — State education agency 
SWESAT —Swedish Scholastic Aptitude Tfest 
TAP — ^Tfests of Achievement and Proficiency 
TIERS — TiUe I/Chapter 1 Evaluation and Reporting 
System 

TNCUEE —Tfest of the National Center for University 

Entrance Examinations 
TSWE —Tfest of Standard Written English 
VCR — video cassette recorder 
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APPENDIX B 

Contractor Reports 



Copies of contractor reports done for this project are available througli the National Technical Information Service 
(NTIS), either by mail (U.S. Department of Commerce, National 'Kchnical Information Service, Springfield, VA 22161) 
or by calling NTIS directly at (703) 487-4650. 

Douglas A. Archbald, University of Delaware, and Arnold C. Porter, University of Wisconsin, Madison, **A 
Retrospective and an Analysis of Roles of Mandated Testing in Education Reform,'' PB 92-127596. 

C.V. Bunderson, J.B. Olson, and A. Grecnberg, The Institute for Computer Uses in Education, ''Computers in 
Educational Assessment: An Opportunity to Restructure Educational Practice,*' PB 92-127604. 

Paul Burke, '^You Can Lead Adolescents to a Tfest But You Can't Make Them Try," PB 92-127638. 

Center for Children and Technology, Bank Street College, ''Applications in Educational Assessment: Future 
Technologies." PB 92-127588. 

Nancy Kober, ''ITie Role and Impact of Chapter 1, ESEA, Evaluation and Assessment Practices," PB 
92-127646. 

George F. Madaus, Bostcxi College, and Thomas Kellaghan, St. Patricks College, Dublin, "Examination 
Systems in the European Commimity: Implica\ions for a National Examination System in the United States," 
PB 92-127570 (see also below). 

Gail R. Meister, Research for Better Schools, "Assessment in Programs for Disadvantaged Students: Lessons 
From AccUerated Schools." PB 92-127612. 

Rv'Ui Mitchell and Amy Stempel Council for Basic Education, "Six Case Studies of Performance 
Assessmert," PB 92-127620. 

Misuse of Tests, PB 92-127653 

1. Larry Cuban, Stanford University, "The Misuse of Ifests in Edu:ation." 

2. Robert L. Linn, University of Colorado at Boulder, "Test Misuse: Why Is h So Prevalent?" 

3. Nelson L. Noggle, Center for the Advancement of Educational Practices, "The Misuse of Educational 
Achievement Tests for Grades K-12: A Perspective." 

♦ ♦ ♦ 

A copy of the contractor repv^it listed below may be obtained by writing to the SET Program, Office of Tfechnology 
Assessment, U.S. Congress, Washington, DC 20510-8025; or by calling (202j 228-6920. 

Gcoige F. Madaus, Boston College, and Thomas Kellaghan, St. Patricks College, Dublin, "Student 
Examination Systems in the European Community: Lessons for the Uiuted States** (abndged version of 
Madaus-Kella^an paper listed above). 
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Ability tests, see aptitude tests 
Abili^-to-benefit tesis, 37 
Accelerated Schools Program, 46 
AccountdjiUty, 4, 8, 17, 43, 53-54, 182, 194-195 

computer q)plicatioiis 266-267 

Federal itqwiement, general, 44, 53-54, 56, 81, 95, 131 

historical perspectives, 1 16 

minimum competency testing and, 57 

performance assessments, 205, 209, 240 

State-level, 53-54, 182 

Title I, 34-36, 56, 81-82, 85, 88-89, 182, 220 

see also system monitoring 
Achievement tests, 13, 20, 85, 121, 166-179, 181 

V. qHitude testis 168-169 

cognitive processes and, 50 

design, 171-179 

historical perspectives, 104-105, 106-109, 116-118, 121, 124 
managerial uses, general, 180-181, 184, 194-197 
national testing proposals, 97 
school districts, score variation, 70, 74 
schooldistricts,usesof, 34-36,56, 81-82, 85, 88-89, 181, 182, 

183, 184, 191, 182, 220 
standards of achievement, 18 
system monitoring, 1 1 

use and interpretation, 166, 169, 179-186, 194-197 

see also minimum competency testing; nomi-referenced 
testing performance assessment 
Adaptive testing, 203 

see also computer-adaptive testing 
Administration and administrators 

bureaucracy, 106, 107, 110, 184 

computer applications, 261 

coiporate model, 1 15-116 

education of, 72 

historical perspectives, 106, 107, 108 

international perspectives, 146-147, 149, 150, 152, 153-154, 

156, 157, 160 
Tide I, 85, 87, 89 

see also efficiency; test administration 
Admissions testing 

high school, 152, 154 

wodc-relatcd skills and, 159 

see also college entrance examinations 
Adults, 32 

see also parents 
Advanced placement tests, 170, 216, 236-237, 242, 247 
African Americans, 93, 124 
Age factors 

historical perspectives, 106, 112, 117-118 

international perspectives, 98, 135, 136, 142, 143 

intelligence testing, 112-113 

minLnum competency testing, 65, 186 

NAEP, 93 

national testing, 30, 31, 98, 135, 136, 142, 143 
performance assessments, 202 
Alaska, 230 

America 2000 Excellence in Education Act, 29, 95, 9t 
American College Tksting program, 68/i, 185, 192, 218, 279 
O jmputer applications, 259 
IE KjC rican Federation of Tbachers, 225 Oij 



American Psychological Association, 259 
Americans With Disabilities Act, 37 
Analytical scoring, 241-242 
Anastasi, Anne, 168-169, 170/t 
Appropriate test use, general, 36-37, 67-77 
Aptitude testing 

ability-to-benefit tests, 37 

V. achievement testing, 168-169 

historical perspectives, 121, 130 

national testing proposals, 97, 130 

see also intelligence testing 
Arendt, Hannah, 105 
Architects, ceitification, 272, 274-276 
Arizona, 206, 207-208, 248 
Arkansas, 201 

Armed Services Vocational Aptitude Bauery, 38 
Art and art appreciation 

music studies, 93 

NAEP, 93 

portfolios, 230, 236-237, 242 
Artificial intelligence, 20, 214. 273 

see also expert systems 
Arts PROPEL, 230 
Attitudes 

student self-esteem, 66, 85, 139 

teacher attimdes toward tests, 182, 184, 207, 208, 217, 
220-221,225, 2?2 

teacher expectations for students, 67, 122, 183, 190,217, 248 

see also motivational factors; public opinion 
Audio-visual aids, see video techniques 
Autlientic testing, 13, 189, 202 
Ayies, Leonard, 1 16-1 18 

Basic sklUs, 17, 43, 64-65 

V. higher order skills, 30, 45, 46, 48, 64, 66 

historical perspectives. 111 

learning processes, 46 

worker-related, 30, 32, 48, 50, 159, 235 

see minimui. competency testing 
Bias, see fairness 

Binet, Alfied, 112-;i3, 114, 118, 171-172, 254 

Bloom's taxonomy, 51 

Brigham,Cari, 120, 124 

Brown v. Board of Education, 130 

Buckley Amendment, see Family Education Rights and Privacy 
Act 

Bureaucracy, 106, 107, 110, 184 

Bureau of Education, 122 

Euros Institute of Mental Measurement, 71-72 

California, 76. 98, 118 
performance assessments, 202, 206, 209, 210-211, 230, 241, 
246-247, 248 

California Achievement Ibsts, 206 

California Assessment Progiam, 210-211, 217, 241 

Canada, 70/i 

Carnegie Corp., 90 

Carnegie Foundation, 199 
- CattcU, L McKccn, 112 
l) 
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CD-ROM technology, 255, 271 

Center for Research on Ev^uation, Standards, and Student 

Tbsting, 39, 248 
Certification, see credeatialing; licensing (professional) 
Cheptex 1, see Title I 

China, People's Republic of, 138, 148-150 

CivU Rights Act, 55 

Qassification 

error analysis and, 176 

standardized tests, 170 

tests, 166 

see selection and placement; tracking 
Classroom management, 267 

see also integrated lean Jng systems 
Coaching, see teaching to the test 
Coalition of Essential Schools, 270-271 
Code of Fair Testing Practices, 36, 70 
Cognitive processes, 48-53 

artificial intelligence, 20, 214, 273 

computer testing, 266, 268-269 

constructed response items, general, 231, 235 

conventional testing of, 6-7 

curricula and, 48, 53 

innovative test technologies and, 20, 22, 214, 273 
research, 39, 43, 45, 46, 51-53, 268 
see also higher order thinking skills; learning processes; 
problem solving 
College Entrance Examination Board, 119, 125-126 

advanced placement tests, 170, 216, 236-237, 242, 247 
College entrance examinations, 13 
American College Ttsting program, 68n, 185, 192, 218, 259, 
279 

higner order thinking skills, 125 

historical perspectives, 118-119, 125-126, 137 

international perspectives, 137-138, 149, 150, 151, 152-153, 

155-156, 159, 162 
legislation, 76n 
misuse, 15-16, 185 
multiple-choice items, 192 
oral, 218 

portfolios, 236-237 

secondary education, effects on, 125-126 
test-takers involvement in development, 68« 
see also Scholastic Aptitude Ibst 
College Outcome Measures Program, 218 
Colleges and universities, see universides and colleger 
Colorado, 58 

Compact, 225 

CommerciaUy developed tests, 56, 87, 166, 184, 207, 276, 279 
computer applications, 261 
copyright, 74 
design flaws, 18 
historical perspectives, 129-130 
international perspectives, 147-148 
legislation on, 75, 76 

norm-referenced tests, 87, 92, 169, 170, 171, 207, 208, 232 
R&D, 38, 279 
revenues, 3, 4,44 
Compensatory education, 181 

Federal support, accountability, 56 
Head Start, 88 

O saming models, 46 n t ^ . 
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see also Title I 
Compulsory education, 149, 151, 157 
Computer-adaptive testing, 20, 22, 2^26, 253, 259-261, 265, 
266 

Computer-administered testing, 18, 22, 25, 174, 253, 257-276, 
279-280 
accountability and, 266-267 
algorithms, 256, 266, 276 
artificial intelligence, 20, 214, 273 
CD-ROM technology, 255, 271 
cognitive processes, 266, 268-269 
commercial, 261 
computer literacy, 253, 265-266 
constnicted-response items, 261 

cost factors, 25. 127, 261, 262-264, 266, 267, 268, 274-276 
defined, 5 

efficiency, 256, 261, 264-265, 268 
errors, 22, 214, 253, 264-266 
ETS, 259, 266, 274-276 
expert systems, 269, 273, 276 
funding, 270, 276, 279 
ORE, 259, 266 

high-stakiTi testing, 24-25, 266-267 
instructional feedback, general, 22, 260, 265, 266, 269 
integrated learning systems, 20, 21, 48, 257-258, 265, 267, 
273 

langi-.age assessment, 22, 261 

licciising, professionals, 26, 261, 262-264, 267, 274-276 
mastery testing, general, 261 
mjthematics, 256, 265, 272, 273 
networks, 257-258, 269 
norm-referenced testing, 259 

V. papcr-and-pencil tests, 259, 261, 264, 265, 268, 272, 274 

performance assessment, 20, 201, 211, 213, 237, 240 

policy issues, general, 276-280 

portfolios, 271 

privacy, 266-267 

problem solving, 269, 272, 273 

leUability, 264, 266, 275 

R&D, 20, 253, 264, 265, 267, 268, 270-271. 274-276, 279 
science tests, 265, 272, 273 
secrecy and se'^urity, 260, 265 
select-on and placement, gener 1, 257, 260 
simulations, 262-264, 272, 274-276 
statistical applications, general, 255^ 256 
system monitoring, 25, 259 
writing, 22, 269-270, 272 
Computers and computer science, 50, 253-280 
artificial intelligence, 20, 214, 273 
CD-ROM technology, 255, 271 
expert systems, 269, 273, 276 
historical perspectives, 127, 130, 255 
integrated learning systems, 20, 21, 48, 257-258, 265, 267, 
273 

item-response theory, 257, 259-260 

optical storage techniques, electronics, 267 

test design, 18, 20, 255-256, 261 

see also databases; machine scoring 
Computer software, 256, 257, 258, 266, 267, 269, 272 
Confidentiality and consent, 36-37 

Buckley Amendment, 36, 75-76, 81 

computer applications, 266-267 
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standards, 70-71, 76 
Congress of the United States 

antitesting movement, 67 

minimum conq)6tency testing, 16 

Title I evaluation data, 87 

see also legislation, specific Federal; policy issnes 
Connecticut, 206, 224-225, 228 
Consistency, see reliability 

Constructed-iesponse items, 18, 19, 141, 201, 206, 208, 209, 
211,213-216,231,235 
computer applications, 261 
defined, 5, 206 

diagnostic testing, 208, 215-216 

higher order skills, 208-209, 215-216 

machine scoring of, 201, 213-215 

mathematical, 209, 211, 212, 2l3n, 215 

performance assessments, 5, 18, 19, 141, 201, 206, 208, 209, 
211,213-216, 231,235 

problem solving, 208, 213, 215, 231, 234 

validity, 206, 208 
Construct validity, 178, 179 
Content validity, 60, 177-178, 184, 189 
Continuous Uniform Evaluation System, 207 
Copyright, 74 

Cost factors, 27-29, 30, 55, 267 
computer technology, 25, 127, 261, 262-264, 266, 267, 268, 
274-276 

international perspectives, 141-142, 146 
machine scoring, 254 
multiple-choice tests, 141-142, 146, 243 
national testing proposals, 98-99 

performance assessments, 201, 208, 209, 211, 216-217, 218, 
236, 243-245 

sampling techniques, 23 

system monitoring, 1 1 

test development, 37-39, 141, 142 

urban a!'eas, 28-29 

see also efficiency 
Court cases, see litigation 
Crcdentialing, 11-12, 170-171, 180-181, 182, 196 

computer applications, 257 

fairness issues, 12, 26, 185 

innovative technologies, 25-26 

international perspec.lves, 141, 142-143, 144-145, 191 

nususe of tests, 37 

national testing propojals, 12, 97 

validation, 196 

see also licensing (professional) 
Criterion-referenced testing, 14, 28, 87 
ability-to-bcnefit tests, 37 
military testing, 120 
NAEP, 92 

performance assessments, 18, 20, 238 

State-level issues, 14, 15, 59, 86, 186-187 

test use, 168-170, 186 

validity issues, 176-177, 178, 179 

see also minimum competence testing 
Cubberly, Ellwood P 116 
Culti'xal factors, 76 

A.aericanization, 105-106, 111, 114, 121, 129n, 131 

international perspectives, 136, 151, 154 
O performance assessment, 24, 246-247 



pluralism, 104, 120, 147 
symbolic function of testing, 54, 120-121, 131 
Cuxriculum 
cognitive processes, 48, 53 
English language, 203, 213, 218-219 
Federal involvement, 82, 91, 92, 94-95 
goals, 18, 23, 92, 147 
higher order thii^g, 203 
historical perspectives, 118, 121 

international perspectives, 138-139, 146-147, 152, 154, 160 

mathematics, 48 

reform of, 4, 7, 44-45, 48, 50 

research, 45 

standardized tests v. local curricula, 165, 168, 178 
standards, 48, 50 

State control, 60-61, 94-95, 207, 210 
validity and, 178 
Curriculum alignment, 50, 60-61, 65, 161-162, 184, 197 
defined, 60 

Lake Wobegon effect and, 63 
NAEP, 33 

Curriculum and Evaluations Standard for School Mathematics, 
188 

Cutoff scores, 57, 59, 84, 87, 150, 171, 176, 181, 187, 196, 237 

Databases 

item banks, 24-25, 191, 255-256 

national, 39 
Debra P, v. TUrlington, 73 
Demographic factors 

gender, 202, 246 

high-stakes testing, 54-55 

historical perspectives, 105-108, 109, 110, 114, 117, 120, 121, 
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